Replaced a high-five-figure observability vendor with a 400-line in-house pipeline, Apoorv Mittal

Context

A high-five-figure annual contract on an observability vendor whose strength was full-stack tracing across services. The product was good. The drift was that their ingest model assumed we'd add an OpenTelemetry SDK to every service we wanted observed.

We had two services on a custom RPC framework that didn't fit the SDK. The vendor's roadmap had nothing on adding a non-OTel adapter, they were doubling down on OTel as the universal pipe. Our roadmap had nothing on rewriting those two services onto OTel, because the rewrite wasn't worth the investment for a tracing nice-to-have.

We were paying full price for a tool that covered roughly 60% of our production surface.

What we'd lose if we left

I named this explicitly before proposing anything else, because naming what you give up is the only honest version of a build-vs-buy call.

Distributed-trace visualisation across services. Their flame-graph UI was genuinely better than anything we'd build.
The "smart" anomaly detection. Useful, but never the thing that paged someone, that was always a threshold-based alert.
Vendor-managed retention. We'd own this now.

What we'd keep

Query-level dashboards (Grafana over Postgres-backed timeseries).
Incident timelines (a 50-line Slack-archive scraper plus a Postgres table).
Threshold-based alerts (already in PagerDuty, vendor-independent).

The decision

Replace the vendor with a 400-line ingestion pipeline, write to Postgres, query with Grafana. Pay one engineer-month upfront, save the contract.

What played out

The build came in at three sprints, not one. The slip was on the boring parts I underestimated:

Multi-tenant scoping in the ingest API (one sprint of unbudgeted auth work).
Backfill of two months of historical data so the "before / after" graphs in incident reviews wouldn't have a hole.
The retention story (what to keep, what to roll up, what to drop) ate a sprint of tuning.

Eighteen months later: still in production, hasn't been the source of an incident. The team that lost distributed-trace visualisation recovered most of it from a stitched-together view of structured logs; not as good, but good enough for the cases where they actually used it.

What I'd do differently

I'd budget three sprints from day one, not one. The estimate was optimistic for the obvious reason: I budgeted the core work and not the surrounding scaffolding. The auth, the multi-tenant scoping, the retention tuning, the backfill, these were all the actual work, and they were predictable enough that I should have padded.

I'd also write the what we'd lose doc as a one-pager and circulate it before the build started, not as part of my pitch to leadership. Two engineers raised the lost-flame-graph concern in the first week of the build. If they'd seen the doc earlier, we could have either addressed it cleanly or, more likely, agreed that the loss was acceptable, and moved faster.

Context

We were paying full price for a tool that covered roughly 60% of our production surface.

What we'd lose if we left

I named this explicitly before proposing anything else, because naming what you give up is the only honest version of a build-vs-buy call.

Distributed-trace visualisation across services. Their flame-graph UI was genuinely better than anything we'd build.

The "smart" anomaly detection. Useful, but never the thing that paged someone, that was always a threshold-based alert.

Vendor-managed retention. We'd own this now.

What played out

The build came in at three sprints, not one. The slip was on the boring parts I underestimated:

Multi-tenant scoping in the ingest API (one sprint of unbudgeted auth work).

Backfill of two months of historical data so the "before / after" graphs in incident reviews wouldn't have a hole.

The retention story (what to keep, what to roll up, what to drop) ate a sprint of tuning.

What I'd do differently