Book a 30-min call →
Skip to main content
Blog · 20 May 2026 · 8 min read

Datadog Bill Shock: The Three Signals That Justify a Rebuild

Most Datadog complaints aren't rebuild candidates. Three diagnostics separate the bills you fight from the bills you rebuild around.

Server infrastructure rack visualisation
TLDR audio briefing
For busy executives
~1m 10s summary · 0:00 / 1:10

“Datadog bill shock” is now a recognised term in mid-market engineering. Search any procurement-themed LinkedIn thread from 2025–2026 and you’ll find the same story: a bill that grew 30–50% year-over-year while headcount grew 10–15%, a CFO who started circling the line item, and a CTO defending the spend because “we’d have to rip out our observability stack.”

Most of those bills are wrong to rebuild around. Three diagnostics separate the bills you negotiate from the bills you rebuild around.

Signal 1: The bill grew faster than headcount over 24 months

This is the single most reliable trigger. If your bill grew at 6× the headcount growth rate over a multi-year window, you have usage drift — a pattern where per-host, per-container, per-log-GB, and per-custom-metric pricing axes compound silently as the system gains complexity.

The mid-market companies we’ve worked with hit this trigger at $35K–$60K/month. At that scale, the bill is now the largest single line item in the infra budget — frequently larger than the AWS bill it was meant to monitor.

A median data point we cite often: roughly $123K/year is what a typical mid-market engineering team is now paying for monitoring alone. The structural problem is that pricing axes were designed for a 2017-era stack profile — small number of long-lived hosts, low log volume, few custom metrics. A 2026 stack — containers, ephemeral compute, OpenTelemetry-everything, AI workloads — generates 10–50× more telemetry per dollar of business value.

The bill grew because the architecture changed. The pricing model didn’t.

Signal 2: Custom metrics are >30% of the bill

Custom metrics are the line item that sneaks up. Every “let’s just instrument that” decision adds a few cents. Multiply by the number of engineers who’ve made that decision over three years and you’re looking at $5K–$25K/month in metrics nobody uses.

The diagnostic is simple. Pull the Datadog admin report on cardinality and metric usage over the last 90 days. Sort by cost contribution. If the top 50 metrics are >70% of the spend and the bottom 5,000 are <5%, you have a vestigial-metrics problem.

Two paths here:

  • The cheap one: a metrics audit + drop the bottom 5,000. Saves $5K–$15K/month immediately. No rebuild needed.
  • The structural one: rebuild on OpenTelemetry + Prometheus, which doesn’t price by cardinality. Same instrumentation cost, no metric-tax.

Pick the cheap path first. If you do it and the bill still grows because new vestigial metrics appear by month four, that’s signal that the pricing model is the structural problem.

Signal 3: You have an in-house infra team that has shipped non-trivial systems

This one is about capability, not cost. A Datadog rebuild is real engineering — 10–14 weeks, parallel-run period, dashboard migration, alerting validation. If your infra team has never shipped a comparable migration, the rebuild is a high-risk undertaking.

The teams that successfully run this rebuild have:

  • 3+ infra engineers with on-call experience
  • Existing investment in OpenTelemetry instrumentation (or willingness to adopt it)
  • A CFO willing to fund the build period (the rebuild costs money upfront)
  • An incident-response culture that won’t blame the rebuild for the next outage

If two of those four are missing, the rebuild is the wrong call. Renegotiate the contract instead — at $50K+/month, vendor concessions on the order of 20–30% are achievable. The math is worse, but the risk is bounded.

When the rebuild is the right answer

The math, when the three signals are present:

  • Current bill: $480K–$600K/year
  • Replacement stack run cost (Prometheus + Grafana + Tempo + OpenObserve): $25K–$45K/year
  • Build cost: $90K–$160K fixed-price over 11–14 weeks
  • Year-1 reclaim: $300K–$450K (net of build cost)
  • Year-2+ reclaim: $420K–$555K/year

These numbers are real engagements from 2024–2026. Your specific reclaim depends on stack size and log volume — we built a calculator for this exact decision (linked at the bottom of this post).

When the rebuild is the wrong answer

Three patterns where we tell clients not to rebuild:

  1. Total spend under $200K/year. The rebuild engineering cost dominates the reclaim. Renegotiate instead.
  2. No in-house infra team. A rebuild that depends on the same vendor (us or anyone) to operate forever is not a rebuild — it’s a procurement swap.
  3. Datadog usage is concentrated in features competitors don’t replicate well. RUM, synthetic monitoring at very high frequency, certain CSPM workflows. These can be kept on Datadog while you rebuild around them.

The contract-side play (often the first move)

Before any rebuild conversation, the procurement-side play is straightforward:

  • Pull cardinality, log volume, and custom metric reports
  • Identify 20–30% of usage that’s vestigial
  • Walk into the renewal with a credible “we will reduce this by 25% one way or another” position
  • Vendors at $50K+/month will move

This works once or twice. By the third renewal, the bill has compounded again and the math has shifted.

That’s when the rebuild conversation becomes serious.

What we ship

For clients where the math is real and the team is capable: fixed-price Datadog rebuild engagements, 11–14 weeks, code in your GitHub from day one, parallel-run period built into the schedule. The Datadog cost + rebuild ROI calculator below estimates payback months for your specific configuration.


Read more: /upstream/datadog-alternative · /upstream/ · /calculators/datadog-cost

#upstream#datadog#observability#saas-economics
Want this kind of work for your stack?Book a 30-min call →