Managing Storage Costs for GPU-Intensive ML Workloads

In the world of traditional CPUs, it's become increasingly less common for I/O to produce major cost headaches. With relatively low costs for x86 or Arm chips, having them run at modest utilization is inefficient, but not a fatal financial situation. Moreover, since AWS has made EBS a requirement for many instance types, and added NVMe-based SSDs for others, there are plenty of options for accelerating I/O to prevent CPUs from waiting on data. However, this issue remains a significant challenge with GPUs and neural networks, where compute costs are often a factor of 5-10x higher, and the economics of training or inference can fall apart with insufficient I/O.

In addition to the high cost of idle GPUs, the network file systems or block stores used in traditional CPU-based applications cannot handle the data volumes or I/Os required to process data and train modern foundation models (FMs). As a result, many foundation models and LLMs are being trained using parallel file systems, including AWS Lustre FSx, that offer both the I/O to keep latency down, and to prevent the GPU from being underutilized as it waits on data transfers.

Managing Costs of Parallel File Systems

While parallel file systems might solve for the cost of idle GPUs, they introduce additional costs on their own. Lustre on SSDs costs up to 30 cents per GB, or about 12x the cost of standard S3 storage. As a result, optimizing I/O and storage costs has become a key part of managing AI infrastructure costs that's often overlooked.

Scratch space can greatly help manage Lustre costs, so they don't outweigh the benefits of better utilization of GPUs. Specifically, Lustre offers temporary "scratch" space that can feed models without permanent storage. The cost is generally about half of what permanent Lustre storage costs. The data remains in S3 for durability, and you can link your S3 bucket to the scratch file system.

Once the file system and bucket are linked, the objects in the S3 store will be loaded into Lustre upon first access. There are other manual ways to "promote" data out of S3, but this is the most commonly used method. Nonetheless, a key cost optimization here is to ensure striping is set to correspond to file sizes. Larger files will generally require chunks of 1 GB or higher, with smaller files typically using 128 or 256 MB chunks. This is important to avoid wasting space within the scratch file system. In essence, the scratch is similar to a cache in a traditional compute system, and it's good to treat it as such in terms of volatility.

Augmenting the Savings from Scratch Space

In addition to using scratch space, it's generally advisable to avoid cross-AZ replication. The instances using the data should be in the same AZ to avoid any data transfer costs. Additionally, the file system should be spread across instances, which allows reuse within the AZ of operation. Importantly, as a cache, this layer should not be used to meet any kind of durability needs, which should remain in the S3 bucket.

Another important consideration is to minimize API calls to the S3 bucket. This can be set to recognize new files only, not edits or deletes, to minimize promotion to the scratch space. Compressing the files is also an important consideration, especially for use cases with heavy amounts of text. Finally and perhaps most importantly, set up a cron job to clean out files older than two or three days. There is no garbage collection or algorithm to remove files from the scratch space, but this can be set once by creating the job that will do this.

While storage tiering is a well-known cost optimization technique for traditional compute, Lustre scratch space offers comparable benefits for ML model training, and is especially important to avoid the high costs of both idle GPUs and high I/O parallel file storage.





Next
Next

Does Your Company Have an AI or A1 Strategy?