Optimizing AWS Bedrock Costs

As AI usage grows and expands, FinOps teams need to extend their optimization efforts into rapidly growing AI applications like AWS Sagemaker and Bedrock.  In this article, I’ll focus on Bedrock, which provides developers access to existing Foundation Models (FM) including Amazon’s own Titan text and image models, Anthropic Claude, AI21, Meta’s Llama, Cohere, HuggingFace and others.  By providing access to existing FMs, Bedrock allows developers to take advantage of pre-trained models that can be customized and fine tuned to focus on industry-specific use cases or to build applications like chatbots.   Models vary on multimedia capabilities, number of parameters, and fine tuning capabilities.  In addition to costs associated with input and output tokens, these models can be run with batch processing, with provisioned throughout, or on demand, which represent key decisions for FinOps teams.

Understanding Costs Per Input and Output Tokens

The first step in understanding Bedrock billing is per token costs for the various models, with newer models generally costing a lot more.  Claude 3 Sonnet, for example, costs .3 cents per 1,000 input tokens, while the older Haiku model costs 12 times less, or .025 per input token.  Output to input token pricing varies widely by model, ranging from a ratio of 1.25 output to input for smaller models like Llama 2 or Amazon’s own Titan Text, to a ratio of 5 for Claude.  With careful prompt engineering, you can test prompts to assess your output/input ratio before making a full commitment to any of the models, or to optimize prompts to limit output generation as part of preparing for implementation.

Many users select the FM before selecting Bedrock as a platform, but in addition to older models costing less, text-only models are generally cheaper, as are Amazon’s own Titan text models.   Bedrock also offers embedding models to transform input tokens into vectors, with pricing significantly lower than for other FMs. 

Provisioned Throughout and Batch Processing

In terms of optimizing costs, one of the first decisions is whether the model can be run asynchronously.  If it can, Bedrock offers a batch option, where output is sent to an S3 bucket for access at a later time.  Batch is not available in all regions for all models, but where it is, it is approximately 50% cheaper than on demand per token pricing.   This can be useful for large datasets, report generation, overnight jobs, or other similar use cases.  

Where jobs aren’t suitable for batch output, another way to save on per token costs is to run the model with provisioned throughout rather than on demand.  This is a simple drop down option when selecting the model, as shown below.  Provisioned throughput is available in 1 month or 6 month increments, and uses “Model Units” as a billing construct, with each Model Unit, or MU representing a certain number of input tokens per minute for each FM.  The problem is AWS does not publish what this token count is, leaving you the customer to experiment on your own to assess whether it will save money over on demand.  So while selecting provisioned throughout appears to be simple in the dropdown menu, it’s really for larger models and customers.   AWS customers with Enterprise Support should work with their Technical Account Manager (TAM) and the broader account team before selecting this.  Being both relatively new, and with undisclosed throughput levels, it can be a perilous way to try to save money, especially if there’s any variability expected with the pace of model inputs.


Operating Metrics to Track

Even with the lack of transparency behind Bedrock MUs, it is still a good idea to track key consumption metrics including the rate of input tokens consumed, the rate of output tokens consumed, token usage per application, idle time, and failure rates.  In addition to averages, tracking the variances in each of these metrics will help guide optimization efforts, in addition to provisioned throughout decisions.

Beyond batch processing, provisioned throughout, and model selection, optimizations can be done through the model itself, particularly with prompt engineering and caching frequent results.  Limits on token generation can also be set to prevent costs from exceeding certain levels.  Overall, Bedrock offers a new set of optimization challenges and opportunities.  As more and more organizations use the service, FinOps must play a key role, and partner with Engineering teams to optimize input tokens, model throughput, and batch processing strategy.

Previous
Previous

Does Your Company Have an AI or A1 Strategy?

Next
Next

Key Metrics for Vector Database Cost Optimization