Managing Storage Costs for AI Workloads

May 16

In the world of traditional CPUs, it's become increasingly less common for I/O to produce major cost headaches. With relatively low costs for x86 or Arm chips, having them run at modest utilization is inefficient, but not a fatal financial situation. Moreover, since AWS has made EBS a requirement for many instance types, and added NVMe-based SSDs for others, there are plenty of options for accelerating I/O to prevent CPUs from waiting on data. However, this issue remains a significant challenge with GPUs and neural networks, where compute costs are often a factor of 5-10x higher, and the economics of both training and inference can fall apart with insufficient I/O.

In addition to the high cost of idle GPUs, the network file systems or block stores used in traditional CPU-based applications cannot handle the data volumes or I/Os required to process data and train modern foundation models. As a result, many foundation models and LLMs are being trained using parallel file systems, including AWS Lustre FSx, that offer both the I/O to keep latency down, and to prevent the GPU from being underutilized as it waits on data transfers.

Managing Costs of Parallel File Systems

While parallel file systems might solve for the cost of idle GPUs, they introduce additional costs on their own. Lustre on SSDs can cost nearly 30 cents per GB per month, or about 12x the cost of standard S3 storage. As a result, optimizing I/O and storage costs has become a key part of managing AI infrastructure costs that's often overlooked.

Scratch space can greatly help manage Lustre costs, so that they don’t outweigh the benefits of better utilization of GPUs. Specifically, Lustre offers temporary scratch space that can feed models without permanent storage. The cost is generally about half of what permanent Lustre storage costs. The data remains in S3 for durability, and you can link your S3 bucket to the scratch file system.

Once the file system and bucket are linked, the objects in the S3 store will be loaded into Lustre upon first access. There are other manual ways to "promote" data out of S3, but this is the most commonly used method. Nonetheless, a key cost optimization here is to ensure striping is set to correspond to file sizes. Larger files will generally require chunks of 1 GB or higher, with smaller files typically using 128 or 256 MB chunks. This is important to avoid wasting space within the scratch file system. In essence, the scratch is similar to a cache in a traditional compute system, and it's good to treat it as such in terms of volatility.

Augmenting the Savings from Scratch Space

In addition to using scratch space, it's generally advisable to avoid cross-AZ replication. The instances using the data should be in the same AZ to avoid any data transfer costs. Additionally, the file system should be spread across instances, which allows reuse within the AZ of operation. Importantly, as a cache, this layer should not be used to meet any kind of durability needs, which should remain in the S3 bucket. There are features in Lustre HSM (Hierarchical Storage Management) that can improve durability but these should be assessed on case-by-case basis.

Another important consideration is to minimize API calls to the S3 bucket. This can be set to recognize new files only, not edits or deletes, to minimize promotion to the scratch space. Compressing the files is also an important consideration, especially for use cases with heavy amounts of text. Finally and perhaps most importantly, set up a cron job to clean out files older than two or three days. Alternatively, you can setup a purge policy via Lustre HSM, especially where the size of the dataset justifies the additional processing overhead. Either way, there is no least recently used algorithm or garbage collection that will run automatically as in a CPU-connected cache. Fundamentally, data movement back to S3 is policy driven more than algorithmically driven.

Key Steps to Minimize High I/O Storage Costs:

Establish a scratch space to use as a cache for frequently accessed files
Match the striping strategy to the average file size - i.e. 128/256 MB for smaller files, 1 GB for larger ones
Avoid cross-AZ replication to limit egress costs
Minimize API calls to the S3 bucket by recognizing new files only
Create a cron job to clean out files 2-3 days old, or use Lustre HSM to send aging files back to S3

Managing High I/O Storage

While storage tiering is a well-known cost optimization technique for traditional compute, Lustre scratch space offers comparable benefits for ML model training, and is especially important to avoid the high costs of both idle GPUs and high I/O parallel file storage.

David Gross

Managing Storage Costs for AI Workloads

Developing a Unit Cost Measure for Inference

Optimizing AWS Bedrock Costs