Four Ways to Manage Datadog Costs for AI

Sep 15

Datadog recently upped its revenue guidance and there's new optimism around its stock with LLM and GPU monitoring seen as catalysts for more growth.

There is some logic behind the enthusiasm for Datadog's growth, even if there's none with how it bills for its services. Application awareness is far more essential to managing LLM and GPU costs than it is with commodity hardware FinOps.

You need to measure price/performance metrics like TTFT (Time-to-First-Token) latency, cache hit ratios, and need to be able to trace resources consumed by top API and LLM endpoints. Datadog can play a valuable role in that.

However, putting more logs and metrics into Datadog without optimization is expensive, especially when logs and custom metrics alone can account for over half a Datadog bill. The main thing to think about is how much precision you need.

For all of the twists and turns in Datadog billing across its logging and custom metrics SKUs, the main tradeoff is precision vs. cost. Too much precision is often not worth the cost, too little will leave SREs insufficient data to manage SLAs and incidents, and FinOps with too little data to assess price/performance.

Getting the precision/cost balance right is worth the time investment given the size of a typical Datadog bill, and the impact the data has on infrastructure operations.

Following are four steps you can take to implement a balanced approach to cost vs. precision:

1. Separate Security and SRE Logs

This is always good practice, but especially with AI due to increasing numbers of logs. Merging security and SRE logs creates a gross common factor approach to logging where you maximize ingested and indexed logs according to each team's needs, well in excess of what either needs.

Security usually has much tighter requirements around HIPAA, SOX, and other regulatory rules that typically far exceed what SRE or ops needs, which can force excess logging that they can do on a cheaper, security-focused platform. Moreover, this doesn't just hit on one Datadog SKU.

Datadog bills for logs up to five times, creating what you might call quintuple taxation:

Per GB at ingest
Per log event for indexing
Per GB for index retention from 3 to 30 days
Per log event for Flex Logs Storage, which provide warm storage for up to 15 months
Per XS, S, M, L for Flex Logs Compute to query stored Flex Logs, but they don't tell you what the sizes correspond to in terms of CPU or memory

With logs billed off of five separate SKUs, some based on events, others based on GB, you can quickly find that an optimization on one SKU is acceptable for ops, but is a problem for security. So you're not just taking a hit for excess logs on one SKU, but across up to five.

To avoid over logging to accommodate security, it's best to at least move those logs to a separate index, or often to a completely different vendor. Datadog's strength is in observability, not security. Finance or procurement might object by saying there could be better discounts by consolidating with one vendor, but I've lived through this at multiple companies and that's rarely the case.

The savings from reducing SRE logs, especially indexed logs, is usually way beyond whatever extra volume discount is available. So this is typically a win for Finance, SecOps, and SRE.

2. Determine Sampling Sensitivity to Longer Intervals

One of the logging traps many companies fall into is logging CPU or memory utilization every 10 seconds, along with temperature. This is often overkill, especially outside of peak hours.

Downsampling to 60 seconds off-peak can reduce logs during these periods, reducing costs significantly. Here again, it's good to be segregated from security because they might not be able to downsample for compliance or for other reasons while SRE can.

3. Use Metrics Buckets

A lot of what I've been discussing relates to logging costs, but there's an additional ingest and indexing cost considerations around custom metrics. Datadog charges 5 cents per month per custom metric (over a base number of metrics you get depending on the number of infra hosts monitored).

Let's just consider what this means for collecting latency values for an application where latencies range from 10 to 1009 ms:

800 API endpoints w/ 1,000 possible ms values = 800,000 custom metric combinations * .05 = $40,000/month

Custom metrics like latency are billed by "cardinality", or the number of possible values. So if you have 1,000 values between 10 and 1009 as above, you pay for 1000 custom metrics as shown above. But if you bucket these down to 10 ms segments, like 1-10ms, 11-20ms, 21-30ms etc, you pay for just 100 custom metrics.

800 API endpoints w/100 possible ms values = 80,000 custom metric combinations * .05 = $4,000/month

Don't worry if this billing makes any sense, it's not supposed to. But it exists, so you pay a high premium for excessive precision, just as you do excessive logging. But you do lose some granularity.

The question is what's the benefit of knowing the exact latency of an API call down to the ms, rather than knowing it's between 100 and 110ms, or 160 and 170ms. There's an optimal point between precision and cost depending on your business needs, because even larger buckets, and therefore fewer potential values, might work for you.

4. Target Your Observability Budget against Overall Cloud Spend

Like anything else, observability costs should have some budget foundation beyond just a random dollar amount. If AI is causing observability to shoot past 25% of your cloud bill, then that's a good sign you need to optimize observability. If it's under 10% of your cloud bill, that's a good sign that you need to optimize cloud spend, which can include better logging and metrics of GPU and LLM activities in Datadog.

Summary Checklist

Here's a summary checklist to review your key decisions around precision vs. cost:

□ Observability > 25% of cloud spend? → Optimize observability cost first

□ Using exact latency values for alerting? → Keep precision

□ Using latency for trends/SLA monitoring? → Consider bucketing

□ Security and SRE using same logs? → Separate immediately

Datadog billing is more complex than public cloud billing, so I've tried to keep this as understandable as possible. There are additional hooks regarding infra hosts and custom metrics that might play into your environment and could merit additional consideration. Feel free to DM me if you have specific questions around any of these areas.

David Gross