Top 7 Inference KPIs

Aug 22

Introduction

AI is transforming FinOps. The profession grew up in the late 2010s and early 2020s around commodity hardware SaaS and Cloud services, and had a lot of success providing business value across the economy. However, AI infrastructure is quickly changing the focus for FinOps practitioners, requiring a deeper understanding of hardware and software configurations to sustain the usefulness the profession established earlier in the decade.

“Shift Left” is a term that’s been commonly used in the FinOps community to manage costs proactively, and move beyond reactively managing last month’s cloud bill. With AI, this is not just a question of getting a spot in the release process or following an existing development process. Rather the immaturity of AI requires that FinOps creates processes others can follow, providing strategic direction across business and engineering functions, not just participating in cost management. The foundation of this leadership will come from an area in which FinOps has always excelled: managing and measuring the data on which major decisions are made. However, for FinOps for AI to have business impact, these proactive measures need to incorporate price/performance and business metrics that don’t just track cost per inference or cost per token, but the KPIs that drive them.

Addressing the Memory Wall

One of the hallmarks of effective FinOps is communicating clear priorities, and not getting trapped in a “death by dashboards” list of metrics that leave users wondering what to focus on. In the case of inference, these priorities need to center on its scarce resource, which is memory. Yes, other components are important and influence costs, but the most commonly constrained resource is inference in memory.

HBM (High Bandwidth Memory) capacity and bandwidth are doubling every three years. This is not enough to keep up with growing context windows, parameters, and cached tokens. Model hosting providers are increasingly looking at offloading KV (Key Value) Cache onto SSDs or other non-DRAM based devices to store previously generated output tokens, and more innovations here are likely. Model users, who often can’t control KV Cache algorithms, can still control prompt caches and can manage memory resources via GPU selection. Yes, GPU FLOPs are important and need to be measured. However, with a variety of caching options that reduce GPU re-computation, it’s easier to control GPU utilization with appropriate memory management techniques than the other way around.

KPIs have been considered here around both vLLM and TensorRT-LLM environments. The first four are centered on price/performance, while the three that follow are general efficiency measures.

Top 7 KPIs

Cache Hit Rate

Calculation: Cache Hit Rate = (Cache Hits / Total Prompt Requests) × 100%

Target Value: Varies by application by cost, but typically 65-90%.

Background: While there are many cache hit rates, here are two primary uses that take up the most space: Prompt Caching and KV Caching. If you’re not hosting a model, then the Prompt Cache Hit Rate will be your top priority.

Impact - Pricing: Nothing directs price/performance more than the Prompt Cache Hit Rate. First let’s look at price. Model providers generally charge a 200-1000% premium for cache misses. DeepSeek v3 for example, charges 7 cents per million cache hit tokens, but 27 cents per million cache miss tokens. Claude Sonnet 4 charges 30 cents per million cache hit tokens, but $3.00 for cache miss tokens, which are billed the same as a standard API call, because the system has to process the data from scratch.

When selecting GPUs, the more memory you have, the more you can cache. There are also optimization techniques to improve cache performance, including reusable prefixes and standardizing prompt templates. While GPUs present a cost tradeoff that needs to be assessed, the optimizations make sense in a majority of cases.

Impact - Performance: Actual performance will vary, but the cache miss penalty tends to grow with longer prompts. In any event, a cache miss will increase TTFT (Time to First Token) anywhere from 2x to 20x.

While most of the KPIs that follow measure performance against a cost or efficiency metric, cache hit rate is so important that it’s worth measuring as a simple calculation on its own.

2. Cost per ms - Time-to-First Token (TTFT)

Calculation: Infrastructure Cost per Second × Average TTFT in Seconds

The question you want to answer here is how much will you spend to reduce TTFT. Mathematically this is not a linear relationship, so you can add a logarithmic factor on top of this if needed.

Background: One of the most important factors behind additional infrastructure investment for inference is latency tolerance. By quantifying the value of dropping from say, 500 ms, to 400 ms, you can assess whether the additional cost is worth it. This doesn’t just apply to raw GPU power or memory bandwidth, but reducing cold starts or cache optimizations.

Impact - Price/Performance: Provides a measure of calculating the cost of reducing latency, an important consideration for configuration decisions.

3. Cost per ms - Time-Between-Tokens (TBT)

Calculation: Infrastructure Cost per Second x Average TBT in Seconds

Similar to Cost per ms for TTFT except with TBT, Becomes increasingly important for longer output lengths.

4. Used Memory Bandwidth Cost per GB/s

Calculation: (Memory Bandwidth*Utilization)/Cost/Hour

Target Value: 1-1.5 cents per GB/sec per hour

If you’re using a provider, you’re paying for the GPU, not just the memory, but for simplicity, it’s easiest to calculate the cost per hour for the GPU. So an H100 that provides 3.350 TB/s of memory bandwidth and costs $2.00 per hour is approximately 60 cents per TB/second. 60% utilization works out to $1.00 per TB/sec per hour.

Background: Memory bandwidth is the scarce resource in many training and inference applications. Tracking both the cost and the utilization provide a good measure of how efficiently it’s being used.

Impact - Price/Performance - A key measure for selecting the right instance type and quantifying upgrade points. As a summary measure, it works fine as a stand-alone assessment of how well memory is being used.

5. Cost per Answer

Calculation: Cost Per Answer = Total Inference Costs / Successful Answers Delivered

Background: A better measure than cost per request because it reflects business value, and can be improved by reducing the percentage of requests that fail.

Impact - Price/Performance: A summary metric that’s less about milliseconds reflected in other KPIs, but directs optimizations towards business outcomes.

6. Model Serving Cost per Hour

Calculation: Model Serving Cost/Hour = (Compute + Network + Storage + Shared Costs) / (Hours of Service)

Background: A good all-in metric to track what it costs to run a model. Will vary in practice based on context windows and parameters, but is also a good way to identify inefficiencies. The proportion of costs will vary depending on whether you host models. Those who do will spend more on storage

One caveat for lower throughput models is that when buying a cloud provider’s serverless inference there can be a range of billing units - tokens, compute, data inputs - that would need to be assessed within the context of those services.

7. Productive Runtime

Calculation: (Total Allocated Time - Productive Processing Time) / (Total Allocated Time × 100) Where: Total Allocated Time = All time compute resources are reserved - Productive Processing Time = Time generating useful, successful outputs.

Productive Processing Time is derived from GPU utilization, memory utilization, throughput efficiency (average throughput/max theoretical), and request success rate.

The hourly costs of GPU, memory, storage, and network resources can then be summarized in a cost per successful request.

Background: By pulling together GPU utilization, queue wait times, memory utilization, and request failures, you can create an all-in-one utilization metric that both pinpoints areas of inefficiency, and provides a summary metric of total utilization across components.

Optimizations

I’ll go into optimizations more in a future edition, but these KPIs are all conducive to prioritizing dynamic batching, quantization, prompt engineering, and other ways to improve price/performance and efficiency. Either way, KPIs are an important way for FinOps not just to measure costs, but to take the lead in driving inference strategies.

David Gross

Top 7 Inference KPIs

Introduction

Addressing the Memory Wall

Top 7 KPIs

Optimizations

Why Batch Processing is not a Good Savings Option for LLM End Users