The hidden cost of stack sprawl: a Datadog rebuild walkthrough
When observability tooling costs more than the systems it monitors, it's time to rebuild.

Datadog is excellent software. The pricing model is the problem.
In 2024–2026, we’ve worked with five mid-market and enterprise companies whose Datadog bill went from “expensive but acceptable” to “the largest single line item in our infra budget” without a corresponding increase in product complexity. The pattern was the same across all five: usage drift on per-host, per-container, per-log-GB, per-custom-metric pricing axes that compound quietly as the system grows.
This post is the walkthrough of one of those rebuilds — what we replaced Datadog with, what we kept, and what the post-rebuild cost looks like.
The starting state
The client was a B2B SaaS company at $40M ARR, ~80 production services, ~600 hosts, ingesting ~2 TB/month of structured logs. The Datadog bill at engagement start: $48K/month, $576K/yr. Three years prior the bill had been ~$8K/month. Headcount had grown 2.4×; bill had grown 6×.
The CFO had flagged the line item. The CTO’s defence was that Datadog was load-bearing for incident response and could not be ripped out without an engineering project.
That engineering project is what we shipped.
The architectural decision
Datadog as deployed at this client covered five distinct functions:
- Infrastructure metrics (host CPU, memory, disk).
- Application metrics (custom counters, histograms, gauges from app code).
- Distributed tracing (request traces across services).
- Logs aggregation and search.
- Alerting and on-call routing.
We did not try to find a single replacement. The economics work better when you split the functions across three or four purpose-fit tools, each priced on usage rather than per-host.
The replacement stack:
| Function | Replacement | Annualised cost |
|---|---|---|
| Infrastructure metrics | Prometheus + Grafana (self-hosted) | ~$8K/yr (compute + storage) |
| Application metrics | OpenTelemetry SDK → Prometheus | ~$0 (uses same Prom stack) |
| Distributed tracing | Tempo (Grafana stack) or Jaeger | ~$6K/yr |
| Logs aggregation | OpenObserve (self-hosted) | ~$12K/yr |
| Alerting / on-call | PagerDuty (existing) + Grafana alerts | $0 incremental (PagerDuty was already in stack) |
Total replacement run cost: ~$26K/yr, vs $576K Datadog. Build cost: $94K fixed-price over 14 weeks. Y1 reclaim: $456K. Y2+ reclaim: $550K/yr.
The 14-week schedule
| Weeks | Milestone |
|---|---|
| 1–2 | Audit current Datadog usage, identify which features are load-bearing vs vestigial |
| 3–4 | Stand up Prometheus + Grafana + Tempo + OpenObserve in client AWS account |
| 5–7 | Migrate infrastructure + application metrics; parallel-run alongside Datadog |
| 8–10 | Migrate distributed tracing; rewrite custom dashboards |
| 11–12 | Migrate logs ingestion; replicate Datadog log queries in OpenObserve |
| 13 | Migrate alerts to Grafana alerting; on-call validation week |
| 14 | Datadog cutover, contract non-renewal communicated to vendor |
The parallel-run weeks (5–13) were critical. We ran both stacks in production simultaneously, comparing alert fidelity and dashboard accuracy. By week 13 the team trusted the new stack enough to cut over.
What we kept Datadog-equivalent for
Three small functions we didn’t try to replace:
- Synthetic monitoring. A handful of business-critical synthetic checks. Replaced with Better Uptime ($300/mo). Marginal gain not worth in-house build.
- Real User Monitoring (RUM). The client wasn’t using this heavily. Dropped entirely; revisit if needed.
- Cloud Security Posture (CSPM). Replaced with native AWS Security Hub + Wiz (existing tools). Datadog’s CSPM was redundant here.
Total replacement cost for these: ~$4K/yr. Still inside the $26K total above.
What we did differently this time
Two architecture choices that mattered:
- OpenTelemetry as the instrumentation contract. Application code emits OTel spans and metrics. The backend (Prometheus, Tempo) is swappable without re-instrumenting code. If OpenObserve or Tempo turn out to be the wrong choice in two years, the rip-and-replace is a backend change, not a code change.
- Logs in object storage, not hot indexes. OpenObserve uses S3 as its primary log store with intelligent indexing. The hot-index economics that drive Datadog’s log pricing simply don’t apply. At 2 TB/month ingestion, the difference is ~$300K/yr.
What this rebuild isn’t
It isn’t a claim that Datadog is bad. For the client at $8K/month three years ago, Datadog was the right answer. The function-to-cost match worked. At $48K/month it had stopped working.
The trigger for rebuild is almost always a usage-curve inflection, not a tool quality issue. Datadog hadn’t gotten worse — the pricing model just compounds in ways that the pricing model intends.
How to evaluate your own Datadog spend
Three diagnostics that flag a rebuild candidate:
- The bill grew faster than headcount over the last 24 months.
- Custom metrics are >30% of the bill. This is the line item that sneaks up.
- You have an in-house infra team that has shipped non-trivial systems. The rebuild is real engineering; you need engineers who can absorb it.
If yes to all three, the math is worth running.
Read more: /upstream/datadog-alternative · /upstream/ · /case-studies/
Run the matching free calculator
Each one runs in 3 minutes and emails you an 8-page memo.