Datadog Bill Shock: The Three Signals That Justify a Rebuild
Most Datadog complaints aren't rebuild candidates. Three diagnostics separate the bills you fight from the bills you rebuild around.

“Datadog bill shock” is now a recognised term in mid-market engineering. Search any procurement-themed LinkedIn thread from 2025–2026 and you’ll find the same story: a bill that grew 30–50% year-over-year while headcount grew 10–15%, a CFO who started circling the line item, and a CTO defending the spend because “we’d have to rip out our observability stack.”
Most of those bills are wrong to rebuild around. Three diagnostics separate the bills you negotiate from the bills you rebuild around.
Signal 1: The bill grew faster than headcount over 24 months
This is the single most reliable trigger. If your bill grew at 6× the headcount growth rate over a multi-year window, you have usage drift — a pattern where per-host, per-container, per-log-GB, and per-custom-metric pricing axes compound silently as the system gains complexity.
The mid-market companies we’ve worked with hit this trigger at $35K–$60K/month. At that scale, the bill is now the largest single line item in the infra budget — frequently larger than the AWS bill it was meant to monitor.
A median data point we cite often: roughly $123K/year is what a typical mid-market engineering team is now paying for monitoring alone. The structural problem is that pricing axes were designed for a 2017-era stack profile — small number of long-lived hosts, low log volume, few custom metrics. A 2026 stack — containers, ephemeral compute, OpenTelemetry-everything, AI workloads — generates 10–50× more telemetry per dollar of business value.
The bill grew because the architecture changed. The pricing model didn’t.
Signal 2: Custom metrics are >30% of the bill
Custom metrics are the line item that sneaks up. Every “let’s just instrument that” decision adds a few cents. Multiply by the number of engineers who’ve made that decision over three years and you’re looking at $5K–$25K/month in metrics nobody uses.
The diagnostic is simple. Pull the Datadog admin report on cardinality and metric usage over the last 90 days. Sort by cost contribution. If the top 50 metrics are >70% of the spend and the bottom 5,000 are <5%, you have a vestigial-metrics problem.
Two paths here:
- The cheap one: a metrics audit + drop the bottom 5,000. Saves $5K–$15K/month immediately. No rebuild needed.
- The structural one: rebuild on OpenTelemetry + Prometheus, which doesn’t price by cardinality. Same instrumentation cost, no metric-tax.
Pick the cheap path first. If you do it and the bill still grows because new vestigial metrics appear by month four, that’s signal that the pricing model is the structural problem.
Signal 3: You have an in-house infra team that has shipped non-trivial systems
This one is about capability, not cost. A Datadog rebuild is real engineering — 10–14 weeks, parallel-run period, dashboard migration, alerting validation. If your infra team has never shipped a comparable migration, the rebuild is a high-risk undertaking.
The teams that successfully run this rebuild have:
- 3+ infra engineers with on-call experience
- Existing investment in OpenTelemetry instrumentation (or willingness to adopt it)
- A CFO willing to fund the build period (the rebuild costs money upfront)
- An incident-response culture that won’t blame the rebuild for the next outage
If two of those four are missing, the rebuild is the wrong call. Renegotiate the contract instead — at $50K+/month, vendor concessions on the order of 20–30% are achievable. The math is worse, but the risk is bounded.
When the rebuild is the right answer
The math, when the three signals are present:
- Current bill: $480K–$600K/year
- Replacement stack run cost (Prometheus + Grafana + Tempo + OpenObserve): $25K–$45K/year
- Build cost: $90K–$160K fixed-price over 11–14 weeks
- Year-1 reclaim: $300K–$450K (net of build cost)
- Year-2+ reclaim: $420K–$555K/year
These numbers are real engagements from 2024–2026. Your specific reclaim depends on stack size and log volume — we built a calculator for this exact decision (linked at the bottom of this post).
When the rebuild is the wrong answer
Three patterns where we tell clients not to rebuild:
- Total spend under $200K/year. The rebuild engineering cost dominates the reclaim. Renegotiate instead.
- No in-house infra team. A rebuild that depends on the same vendor (us or anyone) to operate forever is not a rebuild — it’s a procurement swap.
- Datadog usage is concentrated in features competitors don’t replicate well. RUM, synthetic monitoring at very high frequency, certain CSPM workflows. These can be kept on Datadog while you rebuild around them.
The contract-side play (often the first move)
Before any rebuild conversation, the procurement-side play is straightforward:
- Pull cardinality, log volume, and custom metric reports
- Identify 20–30% of usage that’s vestigial
- Walk into the renewal with a credible “we will reduce this by 25% one way or another” position
- Vendors at $50K+/month will move
This works once or twice. By the third renewal, the bill has compounded again and the math has shifted.
That’s when the rebuild conversation becomes serious.
What we ship
For clients where the math is real and the team is capable: fixed-price Datadog rebuild engagements, 11–14 weeks, code in your GitHub from day one, parallel-run period built into the schedule. The Datadog cost + rebuild ROI calculator below estimates payback months for your specific configuration.
Read more: /upstream/datadog-alternative · /upstream/ · /calculators/datadog-cost
Run the matching free calculator
Each one runs in 3 minutes and emails you an 8-page memo.