Optimizing SageMaker Inference Costs…and the Costs to Monitor Inference Performance
AWS has recently bended to competitive pressure, and cut prices for many of its NVIDIA-backed GPU instances. These price cuts extended for use on SageMaker, making the service a little more attractive financially. While you can train on a model on SageMaker and deploy it somewhere else, many companies choose to remain on the platform for inference. Therefore, understanding deployment options is essential for keeping inference costs down on the platform.
AWS offers three options for SageMaker inference: Real-Time, Asynchronous, and Serverless. Real-Time allows you to host more than one model per endpoint, Asynchronous reduces the instance hours needed per model, while Serverless opens you up to the fine AWS game of not knowing what resources you’re actually using, and bills for total Inference duration in seconds. Given the memory-intensive nature of inference, AWS does allow you to specify the amount of memory with Serverless.
Each endpoint represents a URL to which your model has been deployed, and AWS manages the infrastructure behind it. However, there are different types of endpoints as described by the service names above, and selecting which type to use is essential to keeping costs down.
Real-Time vs. Asynchronous vs. Serverless
The first use case to consider is batch inference, because “Real-Time” endpoints can actually be a good option for batch processing. The reason is unlike Asynchronous, which would seem to align to batch use cases, Real-Time allows you to host multiple models on each instance. Therefore, if you can host just two models per endpoint, you’ll end up with the same compute costs as if you run Asynchronous for 50% of the time. Asynchronous does not allow this feature, so understanding the trade off here is essential for batch workloads.
For real-time inference, it is generally impractical to run multiple models on one endpoint, in addition to capacity issues, there is cold start latency when switching among models that share endpoints. Therefore, you’ll end up running Real-Time for many 24/7 workloads, but paying the full cost of an instance (or cluster of instances) for each endpoint. I’ll talk more in a bit about monitoring which becomes key to optimizing costs here..
For workloads that can tolerate latency, or can run batch, another option is Serverless, where you can select the amount of memory, but pay per second for inference time. Inference time is a request-level metric, not a token-level one, so understanding latency and time per request is essential. Generally speaking, this is more expensive than running Asynchronous unless the workloads are particularly spikey and can go hours with extremely light activity.
Be Mindful of the Observability Tax
As is the case with non-AI/ML workloads, you’ll generally pay an “observability tax” to Datadog, Grafana, or your monitoring vendor. That amount can work out to anywhere from 10 to 30% of your SageMaker bill, so while monitoring is important for optimizing SageMaker inference costs, it still needs to be optimized itself.
Datadog will monitor CPU/GPU utilization, memory, disk, and key usage components, as well as error rates, latency, and the number of model invocation requests. These are all important and helpful, but you’ll get charged as you add more types of custom metrics. You can get 200 free metrics per month for each infra host you commit to contractually, however you’ll pay $23 per month (list) for each monitored host, which is more than the cost of a small instance. Moreover, each host adds its own utilization metrics that will increase billable Datadog metrics. So ultimately you need to find the minimal billing point for the bundle of infra hosts and metrics. Another consideration is Datadog bills by the number of hosts running agents - Datadog’s or otherwise - that report into the observability system, regardless of how much they cost to run within AWS. Therefore, selecting smaller instances to save on SageMaker costs can end up increasing the "observability tax” you’ll pay to Datadog.
Ultimately, managing SageMaker inference costs requires careful planning and detailed monitoring. But without planning the monitoring costs as well, you can end up with a lot of surprises in your observability bill.