The Economics of GPUs: How to Train Your AI Model Without Going Broke

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more

Many companies have high hopes for AI to revolutionize their business, but those hopes can quickly be dashed by the enormous costs of training sophisticated AI systems. Elon Musk has pointed out that technical issues are often the reason progress stalls. This is especially evident when hardware like GPUs are optimized to efficiently handle the huge computational demands of training and fine-tuning large language models.

While the big tech giants can spend millions and sometimes billions on training and optimization, small and medium-sized companies and startups with shorter ramp-up times often fall behind. In this article, we explore some strategies that even developers with limited resources can use to train AI models without breaking the bank.

Whoever says A must also say B

As you may know, developing and launching an AI product—be it a base model/large language model (LLM) or a fine-tuned downstream application—relies heavily on specialized AI chips, especially GPUs. These GPUs are so expensive and hard to come by that SemiAnalysis coined the terms “GPU-rich” and “GPU-poor” in the machine learning (ML) community. Training LLMs can be costly primarily because of the costs associated with the hardware (including acquisition and maintenance), but not because of the ML algorithms or expert knowledge.

Training these models requires extensive computation on powerful clusters, with larger models taking even longer. For example, training LLaMA 2 70B involved exposing 70 billion parameters for 2 trillion tokens, which required at least 10^24 floating point operations. Should you give up if you don’t have a sufficient GPU? No.

Alternative strategies

Today, there are several strategies that technology companies use to find alternative solutions, reduce dependence on expensive hardware and ultimately save money.

One approach is to optimize and tweak the training hardware. Although this path is still largely experimental and investment-intensive, it holds promise for future optimization of LLM training. Examples of such hardware solutions include custom AI chips from Microsoft and Meta, new semiconductor initiatives from Nvidia and OpenAI, individual computing clusters from Baidu, rental GPUs from Vast, and Sohu chips from Etched, among others.

Although this is an important step for progress, this method is more suitable for large companies that can afford to make large investments now to reduce later expenses. It is not suitable for new entrants with limited financial resources who want to develop AI products today.

What to do: Innovative software

If you have a tight budget in mind, there is another way to optimize LLM education and reduce costs – through innovative software. This approach is more affordable and accessible to most ML engineers, whether they are experienced professionals or aspiring AI enthusiasts and software developers looking to enter the field. Let’s explore some of these code-based optimization tools in more detail.

Mixed precision training

What it is: Imagine your company has 20 employees, but you rent office space for 200. Of course, that would be a waste of your resources. A similar inefficiency actually occurs in model training, where ML frameworks often allocate more memory than is really necessary. Mixed precision training corrects this through optimization, improving both speed and memory usage.

How it works: To achieve this, lower precision b/float16 operations are combined with standard float32 operations, resulting in fewer calculations being performed simultaneously. To a layman, this may sound like technical mumbo jumbo, but essentially it means that an AI model can process data faster and use less memory without compromising on accuracy.

Improvement metrics: This technique can lead to runtime improvements of up to 6x on GPUs and 2-3x on TPUs (Google’s Tensor Processing Unit). Open-source frameworks such as Nvidia’s APEX and Meta AI’s PyTorch support mixed-precision training and are therefore easily accessible for pipeline integration. By implementing this method, companies can significantly reduce GPU costs while maintaining an acceptable level of model performance.

Activation checkpointing

What it is: If you have limited memory but are willing to invest more time, checkpointing might be the technique for you. In short, it helps to significantly reduce memory consumption by keeping computations to an absolute minimum, thus enabling LLM training without upgrading your hardware.

How it works: The main idea of activation checkpoints is to store a subset of important values during model training and recalculate the rest only when needed. This means that the system does not keep all intermediate data in memory, but only the essentials, freeing up memory in the process. It is similar to the principle of “we’ll deal with it when we get to it”, which means not worrying about less urgent matters until they require attention.

Improvement metrics: In most cases, enabling checkpoints reduces memory consumption by up to 70%, although it also increases the training phase by about 15–25%. This fair trade-off means that companies can train large AI models on their existing hardware without investing additional funds in infrastructure. The PyTorch library mentioned above supports checkpoints, making implementation easier.

Multi-GPU training

What it is: Imagine a small bakery needs to produce a large amount of baguettes quickly. If one baker works alone, it will probably take a long time. With two bakers, the process speeds up. With a third baker, it goes even faster. Training with multiple GPUs works in a similar way.

How it works: Instead of using one GPU, you use multiple GPUs at the same time. The training of the AI model is therefore distributed across these GPUs, allowing them to work alongside each other. Logically, this is sort of the opposite of the previous method, checkpointing, which reduces hardware acquisition costs in exchange for longer runtime. Here we use more hardware, but get the most out of it and maximize efficiency, reducing runtime and reducing running costs instead.

Improvement metrics: Here are three robust tools for training LLMs with a multi-GPU setup, listed in ascending order of efficiency based on experimental results:

DeepSpeed: A library specifically designed for training AI models with multiple GPUs, capable of achieving speeds up to 10x faster than traditional training approaches.
FSDP: One of the most popular frameworks in PyTorch that addresses some of the inherent limitations of DeepSpeed and increases computational performance by an additional 15-20%.
YaFSDP: A recently released enhanced version of FSDP for model training, which offers a speedup of 10–25% over the original FSDP method.

Diploma

By using techniques such as mixed precision training, activation checkpointing, and multi-GPU usage, even small and medium-sized companies can make significant progress in AI training, both in fine-tuning and model building. These tools improve compute power, reduce runtime, and lower overall costs. Additionally, they allow larger models to be trained on existing hardware, reducing the need for expensive upgrades. By democratizing access to advanced AI capabilities, these approaches enable a wider range of technology companies to innovate and remain competitive in this rapidly evolving space.

As the saying goes, “AI will not replace you, but someone who uses AI will.” It’s time to embrace AI, and with the strategies above, it’s possible even on a low budget.

Ksenia Se is the founder of Turing Post.

Data decision makers

Welcome to the VentureBeat community!

DataDecisionMakers is the place where experts, including technical staff working with data, can share data-related insights and innovations.

If you want to read about cutting-edge ideas and information, best practices, and the future of data and data technology, join us at DataDecisionMakers.

You may even want to contribute an article yourself!

Whoever says A must also say B

Alternative strategies

What to do: Innovative software

Mixed precision training

Activation checkpointing

Multi-GPU training

Diploma

Related Posts

Does ATCO Ltd. (TSE:ACO.X) create value for shareholders?

5 Brooklyn Nets games to watch in the 2024-25 season

At JP¥3,103, is it time to add MINEBEA MITSUMI Inc. (TSE:6479) to your watchlist?

Leave a Reply Cancel reply