cloud automation

How to Solve the GPU Shortage Problem With Automation

GPU instances have never been as precious and sought-after as they have since generative AI captured the industry’s attention. Whether it’s due to broken supply chains or the sudden demand spike, one thing is clear: Getting a GPU-powered virtual machine is harder than ever, even if a team is fishing in the relatively large pond of the top three cloud providers. One analysis confirmed “a huge supply shortage of NVIDIA GPUs and networking equipment from Broadcom and NVIDIA due to a massive spike in demand.” Even the company behind the rise of generative AI–OpenAI–suffers from a lack of GPUs. And companies have started adopting rather unusual tactics to get their hands on these machines (like repurposing old video gaming chips). What can teams do when facing a quota issue and the cloud provider runs out of GPU-based instances? And once they somehow score the right instance, how can you make sure no GPUs go to waste? Automation is the answer. Teams can use it to accomplish two goals: Find the best GPU instances for their needs and maximize their utilization to get more bang for their buck. Automation Makes Finding GPU Instances Easier The three major cloud providers offer many types and sizes of GPU-powered instances. And they’re constantly rolling out new ones–an excellent example of that is AWS P5, launched in July 2023. To give a complete picture, here’s an overview of instance families with GPUs from AWS, Google Cloud and Microsoft Azure: AWS P3 P4d G3 G4 (this group includes G4dn and G4ad instances) G5 Note: AWS offers Inferentia machines optimized for deep learning inference apps and Trainium for deep learning training of 100B+ parameter models. Google Cloud Microsoft Azure NCv3-series NC T4_v3-series ND A100 v4-series NDm A100 v4-series When picking instances manually, teams may easily miss out on opportunities to snatch up golden GPUs from the market. Cloud automation solutions help them find a much larger supply of GPU instances with the right performance and cost parameters. Considering GPU Spot Instances Spot instances offer significant discounts–even 90% off on-demand rates–but they come at a price. The potential interruptions make them a risky choice for important jobs. However, running some jobs on GPU spot instances is a good idea as they accelerate the training process, leading to savings. ML training usually takes a very long time–from hours to even weeks. If interruptions occur, the deep learning job must start over, resulting in significant data loss and higher costs. Automation can prevent that, allowing teams to get attractively-priced GPUs still available on the market to cut training and inference expenses while reducing the risk of interruptions. In machine learning, checkpointing is an important practice that allows for the saving of model states at different intervals during training. This practice is especially beneficial in lengthy and resource-intensive training procedures, enabling the resumption of training from a checkpoint in case of interruptions rather than starting anew. Furthermore, checkpointing facilitates the evaluation of models at different stages of training, which can be enlightening for understanding the training dynamics. Zoom in on Checkpointing PyTorch, a popular ML framework, provides native functionalities for checkpointing models and optimizers during training. Additionally, higher-level libraries such as PyTorch Lightning abstract away much of the boilerplate code associated with training, evaluation, and checkpointing in PyTorch. Let’s take a […]

Read More