How to Solve the GPU Shortage Problem With Automation

GPU instances have never been as precious and sought-after as they have since generative AI captured the industry’s attention. Whether it’s due to broken supply chains or the sudden demand spike, one thing is clear: Getting a GPU-powered virtual machine is harder than ever, even if a team is fishing in the relatively large pond of the top three cloud providers.

One analysis confirmed “a huge supply shortage of NVIDIA GPUs and networking equipment from Broadcom and NVIDIA due to a massive spike in demand.” Even the company behind the rise of generative AI–OpenAI–suffers from a lack of GPUs. And companies have started adopting rather unusual tactics to get their hands on these machines (like repurposing old video gaming chips).

What can teams do when facing a quota issue and the cloud provider runs out of GPU-based instances? And once they somehow score the right instance, how can you make sure no GPUs go to waste?

Automation is the answer. Teams can use it to accomplish two goals: Find the best GPU instances for their needs and maximize their utilization to get more bang for their buck.

Automation Makes Finding GPU Instances Easier

The three major cloud providers offer many types and sizes of GPU-powered instances. And they’re constantly rolling out new ones–an excellent example of that is AWS P5, launched in July 2023.

To give a complete picture, here’s an overview of instance families with GPUs from AWS, Google Cloud and Microsoft Azure:

AWS

  • P3
  • P4d
  • G3
  • G4 (this group includes G4dn and G4ad instances)
  • G5

Note: AWS offers Inferentia machines optimized for deep learning inference apps and Trainium for deep learning training of 100B+ parameter models.

Google Cloud

Microsoft Azure

  • NCv3-series
  • NC T4_v3-series
  • ND A100 v4-series
  • NDm A100 v4-series

When picking instances manually, teams may easily miss out on opportunities to snatch up golden GPUs from the market.

Cloud automation solutions help them find a much larger supply of GPU instances with the right performance and cost parameters.

Considering GPU Spot Instances

Spot instances offer significant discounts–even 90% off on-demand rates–but they come at a price. The potential interruptions make them a risky choice for important jobs. However, running some jobs on GPU spot instances is a good idea as they accelerate the training process, leading to savings.

ML training usually takes a very long time–from hours to even weeks. If interruptions occur, the deep learning job must start over, resulting in significant data loss and higher costs. Automation can prevent that, allowing teams to get attractively-priced GPUs still available on the market to cut training and inference expenses while reducing the risk of interruptions.

In machine learning, checkpointing is an important practice that allows for the saving of model states at different intervals during training. This practice is especially beneficial in lengthy and resource-intensive training procedures, enabling the resumption of training from a checkpoint in case of interruptions rather than starting anew.

Furthermore, checkpointing facilitates the evaluation of models at different stages of training, which can be enlightening for understanding the training dynamics.

Zoom in on Checkpointing

PyTorch, a popular ML framework, provides native functionalities for checkpointing models and optimizers during training. Additionally, higher-level libraries such as PyTorch Lightning abstract away much of the boilerplate code associated with training, evaluation, and checkpointing in PyTorch.

Let’s take a look at an example:

import torch

import torch.nn as nn

import torch.optim as optim

# Assume model is an instance of a PyTorch nn.Module

model = nn.Linear(10, 1) 

optimizer = optim.SGD(model.parameters(), lr=0.1)

# Define a path to save the checkpoint

checkpoint_path = ‘checkpoint.pth’

# Saving checkpoint

torch.save({

            ‘epoch’: epoch,

            ‘model_state_dict’: model.state_dict(),

            ‘optimizer_state_dict’: optimizer.state_dict(),

            ‘loss’: loss,

            }, checkpoint_path)

# Loading checkpoint

checkpoint = torch.load(checkpoint_path)

model.load_state_dict(checkpoint[‘model_state_dict’])

optimizer.load_state_dict(checkpoint[‘optimizer_state_dict’])

epoch = checkpoint[‘epoch’]

loss = checkpoint[‘loss’]

# Resume training

model.train()  # or model.eval() if you are not resuming training

In the snippet above, torch.save is called to save a dictionary containing the model and optimizer states and other relevant information like the epoch number and loss value to a file. To restore the training state, torch.load is used to load the checkpoint, and the load_state_dict methods are used to restore the model and optimizer states.

From a Kubernetes perspective, saving to a local file is not ideal. However, you can use a PersistentVolumeClaim to attach a non-ephemeral file system to your training job and go even further by using cloud-provided object storage.

If you modify it to save our checkpoints to an AWS S3 bucket, here is what the code looks like:

import boto3

import torch

import torch.nn as nn

import torch.optim as optim

import io

# Assume model is an instance of a PyTorch nn.Module

model = nn.Linear(10, 1) 

optimizer = optim.SGD(model.parameters(), lr=0.1)

# AWS S3 Configuration

s3 = boto3.client(‘s3’)

bucket_name = ‘your-bucket-name’

checkpoint_key = ‘checkpoints/checkpoint.pth’

# Saving checkpoint

checkpoint_buffer = io.BytesIO()

torch.save({

            ‘epoch’: epoch,

            ‘model_state_dict’: model.state_dict(),

            ‘optimizer_state_dict’: optimizer.state_dict(),

            ‘loss’: loss,

            }, checkpoint_buffer)

s3.put_object(Bucket=bucket_name, Key=checkpoint_key, Body=checkpoint_buffer.getvalue())

# Loading checkpoint

checkpoint_obj = s3.get_object(Bucket=bucket_name, Key=checkpoint_key)

checkpoint_buffer = io.BytesIO(checkpoint_obj[‘Body’].read())

checkpoint = torch.load(checkpoint_buffer)

model.load_state_dict(checkpoint[‘model_state_dict’])

optimizer.load_state_dict(checkpoint[‘optimizer_state_dict’])

epoch = checkpoint[‘epoch’]

loss = checkpoint[‘loss’]

The code snippet is not fully complete, as it misses Boto authentication. However, it does provide a good illustration of using the put_object call to save the model checkpoint rather than saving it to a local file.

You can also explore higher-level approaches, such as using PyTorch Lightning, a library designed to simplify much of the boilerplate code associated with training PyTorch models.

How to Find GPUs and Boost Their Utilization: Case Study

One of our customers was working on an AI-powered intelligence solution for detecting social and news media dangers in real-time. The engine analyzes millions of texts simultaneously to detect developing storylines and lets users create Natural Language Processing (NLP) models for defense.

Since the platform uses massive volumes of both classified and public data, its workloads frequently require GPU-enabled instances.

Teams working in the public cloud often create node pools (Auto Scaling groups) to be more efficient. While node pools accelerate the process of provisioning new instances, they can also be expensive, as you may end up paying for capacity that isn’t eventually utilized.

Thanks to automation features such as the autoscaler and Node Templates, the team could leverage more cost-efficient machine types, such as spot instances. Using a simple graphic UI, Node Templates allow users to set specific instance preferences instead of picking them manually. You set the scope, and the autoscaler does the rest, choosing from a wider pool of instances.

Moreover, Node Templates automatically enroll applications on newly-introduced instances, such as AWS P5. This is another task that automation removes from the engineer’s to-do list.

Our customer described their requirements, such as instance types, the lifespan of the new nodes to be added, and provisioning configurations, when setting up a Node Template. The team defined limits such as families it didn’t want to use (P4d, P3d and P2) and the preferred GPU manufacturer (in this case, NVIDIA).

The automation platform discovered five machines matching these specific parameters. When adding additional ones, the autoscaler now adheres to these limitations.

When the GPU task is completed, the autoscaler immediately decommissions instances to help minimize the costs of the entire operation.

Wrap Up

Automation is indispensable in a modern team’s toolkit, especially if they work with cloud-native technologies like Kubernetes and are looking for the most optimal cost-performance ratio for their GPU resources.

GPUs often generate high costs, so using automation to find the best options and maximize their utilization brings a two-fold benefit: It makes engineers’ lives easier and finance teams happier. Spot instance automation helps to reduce the cost of model training, and together with checkpointing, it allows to avoid expenses and trouble in case of interruptions.


To hear more about cloud-native topics, join the Cloud Native Computing Foundation and the cloud-native community at KubeCon+CloudNativeCon North America 2023 – November 6-9, 2023.