Book a 30-min call →
Skip to main content
Blog · 29 Aug 2025 · 11 min read

The hidden cost of stack sprawl: a Datadog rebuild walkthrough

When observability tooling costs more than the systems it monitors, it's time to rebuild.

Server infrastructure
TLDR audio briefing
For busy executives
~1m 10s summary · 0:00 / 1:10

Datadog is excellent software. The pricing model is the problem.

In 2024–2026, we’ve worked with five mid-market and enterprise companies whose Datadog bill went from “expensive but acceptable” to “the largest single line item in our infra budget” without a corresponding increase in product complexity. The pattern was the same across all five: usage drift on per-host, per-container, per-log-GB, per-custom-metric pricing axes that compound quietly as the system grows.

This post is the walkthrough of one of those rebuilds — what we replaced Datadog with, what we kept, and what the post-rebuild cost looks like.

The starting state

The client was a B2B SaaS company at $40M ARR, ~80 production services, ~600 hosts, ingesting ~2 TB/month of structured logs. The Datadog bill at engagement start: $48K/month, $576K/yr. Three years prior the bill had been ~$8K/month. Headcount had grown 2.4×; bill had grown 6×.

The CFO had flagged the line item. The CTO’s defence was that Datadog was load-bearing for incident response and could not be ripped out without an engineering project.

That engineering project is what we shipped.

The architectural decision

Datadog as deployed at this client covered five distinct functions:

  1. Infrastructure metrics (host CPU, memory, disk).
  2. Application metrics (custom counters, histograms, gauges from app code).
  3. Distributed tracing (request traces across services).
  4. Logs aggregation and search.
  5. Alerting and on-call routing.

We did not try to find a single replacement. The economics work better when you split the functions across three or four purpose-fit tools, each priced on usage rather than per-host.

The replacement stack:

Function Replacement Annualised cost
Infrastructure metrics Prometheus + Grafana (self-hosted) ~$8K/yr (compute + storage)
Application metrics OpenTelemetry SDK → Prometheus ~$0 (uses same Prom stack)
Distributed tracing Tempo (Grafana stack) or Jaeger ~$6K/yr
Logs aggregation OpenObserve (self-hosted) ~$12K/yr
Alerting / on-call PagerDuty (existing) + Grafana alerts $0 incremental (PagerDuty was already in stack)

Total replacement run cost: ~$26K/yr, vs $576K Datadog. Build cost: $94K fixed-price over 14 weeks. Y1 reclaim: $456K. Y2+ reclaim: $550K/yr.

The 14-week schedule

Weeks Milestone
1–2 Audit current Datadog usage, identify which features are load-bearing vs vestigial
3–4 Stand up Prometheus + Grafana + Tempo + OpenObserve in client AWS account
5–7 Migrate infrastructure + application metrics; parallel-run alongside Datadog
8–10 Migrate distributed tracing; rewrite custom dashboards
11–12 Migrate logs ingestion; replicate Datadog log queries in OpenObserve
13 Migrate alerts to Grafana alerting; on-call validation week
14 Datadog cutover, contract non-renewal communicated to vendor

The parallel-run weeks (5–13) were critical. We ran both stacks in production simultaneously, comparing alert fidelity and dashboard accuracy. By week 13 the team trusted the new stack enough to cut over.

What we kept Datadog-equivalent for

Three small functions we didn’t try to replace:

  1. Synthetic monitoring. A handful of business-critical synthetic checks. Replaced with Better Uptime ($300/mo). Marginal gain not worth in-house build.
  2. Real User Monitoring (RUM). The client wasn’t using this heavily. Dropped entirely; revisit if needed.
  3. Cloud Security Posture (CSPM). Replaced with native AWS Security Hub + Wiz (existing tools). Datadog’s CSPM was redundant here.

Total replacement cost for these: ~$4K/yr. Still inside the $26K total above.

What we did differently this time

Two architecture choices that mattered:

  1. OpenTelemetry as the instrumentation contract. Application code emits OTel spans and metrics. The backend (Prometheus, Tempo) is swappable without re-instrumenting code. If OpenObserve or Tempo turn out to be the wrong choice in two years, the rip-and-replace is a backend change, not a code change.
  2. Logs in object storage, not hot indexes. OpenObserve uses S3 as its primary log store with intelligent indexing. The hot-index economics that drive Datadog’s log pricing simply don’t apply. At 2 TB/month ingestion, the difference is ~$300K/yr.

What this rebuild isn’t

It isn’t a claim that Datadog is bad. For the client at $8K/month three years ago, Datadog was the right answer. The function-to-cost match worked. At $48K/month it had stopped working.

The trigger for rebuild is almost always a usage-curve inflection, not a tool quality issue. Datadog hadn’t gotten worse — the pricing model just compounds in ways that the pricing model intends.

How to evaluate your own Datadog spend

Three diagnostics that flag a rebuild candidate:

  1. The bill grew faster than headcount over the last 24 months.
  2. Custom metrics are >30% of the bill. This is the line item that sneaks up.
  3. You have an in-house infra team that has shipped non-trivial systems. The rebuild is real engineering; you need engineers who can absorb it.

If yes to all three, the math is worth running.


Read more: /upstream/datadog-alternative · /upstream/ · /case-studies/

#upstream #datadog #observability #rebuild
Want this kind of work for your stack? Book a 30-min call →