Developing a Unit Cost Measure for Inference
As AI and Gen AI in particular continue to grow rapidly, one of the main challenges for FinOps is developing cost measures for deployment. In particular, managing inference costs can be burdensome without a method for measuring those costs. And a familiar approach of starting with billing data might not be the best method for achieving this.
FinOps is often portrayed as a descendant of finance or DevOps. But in reality, its true predecessor is capacity planning and performance management. Some of this has been lost in recent years where companies have had immediate needs to tame a wild cloud bill. However, FinOps is not just about reducing costs per se, but in many cases optimizing price/performance or cost per unit of work done. This is especially true when it comes to foundation model inference where results must be produced with low latency.
Cost is Meaningless without Measuring Against Latency
It is not terribly hard to calculate a cost per inference. You can build a cost pool of allocated hardware, and then divide that by the number of requests processed. Generally speaking, inference-specific software like TensorRT and Triton Inference Server is free, and labor costs are small and fixed, so focusing on inference hardware keeps the cost model simple. But while cost per inference is a good metric to know, it leaves out the very important factor of latency. Cost per inference can easily be decreased while allowing performance to slow, severely limiting the usefulness of this metric. Incorporating latency measures is essential to both measuring and managing inference costs.
The starting point therefore of managing inference costs shouldn’t be the billing console, Cost Explorer, or bill management tools like Apptio/Cloudability or CloudHealth, but rather observability tools like Grafana or Datadog. These are where the performance and latency metrics are tracked, and have the data needed to maintain not just cost, but the cost of accelerating performance per millisecond.
Observability tools like those mentioned above can be integrated with NVIDIA DCGM (Data Center GPU Manager). This allows you to track key inference performance metrics such as cache hit rate, cache hit duration, inference queuing duration, and inference success rate, along with GPU ops and performance metrics like GPU utilization, power consumption, and memory usage. When comparing against cost, you can then trend the cost of inference against these metrics, and assess the benefits of autoscaling, cost sensitivity to lower latency, or other measures you can use to manage costs and performance. You can merge this data with your billing data and then you’ll have not just a cost per inference, but an average cost per inference at varying levels of performance.
Developing Unit Costs
To really focus on this, you can create a cost per successful inference microsecond, or expand it up to a full second to get the numbers out of tiny decimals, or at least milliseconds (thousandths of a second) like I’ve done below, even though NVIDIA Triton reports out microseconds (millionths of second). You can then break that down by input, output, inference, and queuing. Triton reports cache metrics separately, but you can expand the analysis to include the cost of a cache hit vs. a cache miss. The next step is to assess how to amortize the cost of the GPUs. For simplicity, it can make sense to divide the system cost over GPU time. The added steps of amortizing memory, storage, and network might not be worth it, but that can be assessed on a case-by-case basis. And this still allows you to compare costs for single precision to mixed precision or to compare FP32 to INT8. Therefore, the summary cost analysis could look something like this:
One factor to keep in mind is that observability costs are typically 15-30% of any cloud bill, and they charge for custom metrics. Therefore, there’s no need to go overboard importing every Triton data point into your observability tool unless they deliver some value of managing or optimizing costs.
Another point to be wary of is overgeneralizing power and space costs. If you’re including them, do so on a basis that includes cost per kWh for metered power, and a cost per kW for lease costs, or for depreciation if it’s an owned facility. Labor costs should be minimal, especially for inference. When the objective is to optimize inference, you can create misleading distortions if you try to throw in data preprocessing, training, or human annotation costs.
Conclusion
By aligning compute time with system costs and inference resource consumption, you can develop a cost per inference that provides a trackable summary measure, along with component costs to help monitor and optimize model deployments. Moreover, these can then be put into a time series to measure cost trends, and to develop strategies to reduce costs, latency, or both as model usage grows.