Cutting p50 latency by 30%+ on a public API with 50+ consumers, Apoorv Mittal

Performance work has a specific shape. You don't start by writing code. You start by being honest about what slow means, what fast would mean, and which of those things you can actually measure. Most of the work is the measuring, the disagreement about what the numbers say, and the occasional moment of clarity where one flamegraph collapses three hypotheses into the right answer.

This is the story of one of those moments, and the six weeks of instrumentation that made it possible.

The starting state

The API in question was the most-consumed public read path in the product. Fifty-plus downstream integrations on web, desktop, native iOS, native Android, and a partner-facing read API depended on it. Every team's user-visible latency budget had this one endpoint on the critical path.

The numbers when I picked it up:

Percentile	Latency	Pain felt at
p50	mid three-digit ms	the median user, every day
p75	low four-digit ms	the average user occasionally
p95	high four-digit ms	the listing-page bounce rate
p99	low single-digit seconds	the partner-API SLO

The team had attempted to chase this twice before. Both attempts had focused on p99, the worst-case latency that triggers paging, and had moved it incrementally without changing the median experience. Most of the user-visible improvement comes from moving the p50, but p50 work is harder to motivate because nobody pages you for it.

Constraints

Don't break the consumers. Fifty downstream integrations, most with their own caches and client-side assumptions about response shape. Wire-level changes were on the table; semantic changes were not.
Production data only. Synthetic load told us nothing about the shape of real traffic, request distribution, cache-hit ratios, time- of-day patterns.
Two weeks of instrumentation budget. I'd promised the squad a measurable win in six weeks. Two of those weeks went into making the thing measurable.

Approach

Six weeks, three phases.

Weeks 1–2: instrumentation

The endpoint had basic OTel traces and DataDog APM. They were not useful. The traces were dominated by a single span called graphql.execute that wrapped 80% of the wall-clock time and gave us no insight into what was happening inside it.

I went in and added:

Per-resolver timing, attached to traces as span attributes. The GraphQL Tools resolver-level instrumentation already existed, it was just turned off behind a feature flag because of overhead concerns. Profiled that overhead: 0.4ms p50. Turned it on globally.
Per-dataloader batch size and key cardinality, captured as histograms. We had four dataloaders inside this resolver chain; exactly none of them were instrumented.
A flamegraph endpoint behind an internal-only header that ran pprof on the running Node process and shipped the SVG to a scratch S3 bucket. Production-safe (read-only, sampled, gated).
A traffic mirror to a non-production replica, so we could replay representative production load against changes without risking the real users.

Two weeks. Boring work. Without it, everything that followed would have been a guess.

Weeks 3–4: investigation

The flamegraph that mattered showed up on day fifteen.

Twelve percent of the median request, meaningful chunk of a three- digit-ms budget, was being spent inside a single function:

function attachVehicleAttributes(listings: Listing[], lookups: AttributeMap) {
  return listings.map((listing) => ({
    ...listing,
    attributes: enrichAttributes(listing, lookups),
  }));
}

enrichAttributes was a 200-line function that, deep inside, called a helper that called a helper that called JSON.parse(JSON.stringify(...)) to deep-clone the lookup map for each listing. With ~24 listings on a typical page and a lookup map of ~3000 entries, this was 72,000 clone operations per request, on the median path.

I'd love to say I noticed this from a code review. The flamegraph noticed it. The engineer who'd written it three years ago is one of the strongest people on the team, this was a real-world example of "you can't review what's spec'd into the codebase as a function call."

// before, a clone for every listing's enrichment
return listings.map(listing => ({
  ...listing,
  attributes: enrichAttributes(listing, lookups),
}));

// after, the lookup map is read-only, share it
const sharedLookups = Object.freeze(lookups);
return listings.map(listing => ({
  ...listing,
  attributes: enrichAttributes(listing, sharedLookups),
}));

That was the biggest single change. Two more emerged from the same flamegraph:

Dataloader cache leakage. One of the four dataloaders was being instantiated per-request when it should have been per-context. Its cache was getting blown away on every request. The fix was a 6-line change in the context factory.
A redundant authorization check. The graph layer was calling the auth service twice, once to load the user, once to verify the user could access the listings. The second call was always a no-op because the auth tokens included a scope claim. Eliminated entirely.

Weeks 5–6: rollout and validation

All three changes shipped behind a feature flag, ramped 5% → 25% → 50% → 100% over five days. The traffic mirror confirmed the wins on non-production load before any of it reached real users; the staged rollout let me back out instantly if synthetic monitoring tripped.

It didn't trip.

Outcome

Percentile	Before	After	Delta
p50	mid three-digit ms	low three-digit ms	~30% reduction
p75	low four-digit ms	mid three-digit ms	similar
p95	high four-digit ms	low four-digit ms	smaller, but real
p99	low single-digit seconds	low single-digit seconds	unchanged, the long tail was elsewhere

The p99 didn't move. That was important context: this work was about the median experience, not the worst case. The team that owned the long-tail work needed to do separate investigation. Performance work on the median and performance work on the tail are different projects that get conflated all the time.

User-visible knock-on: the listing page render time dropped, and the bounce rate on tracked journeys tracked it down by a low single-digit percentage. Other teams whose critical path included this endpoint saw their own dashboards improve without doing anything.

What I'd do differently

I'd have shipped the flamegraph endpoint first, before anything else. It turned out to be the single most useful artifact of the project. The first two weeks of investigation could have been the first three days if I'd had it earlier. The hesitancy was about safety (running pprof in production sounded like a bad idea); the actual risk profile was much smaller than the perceived one.

I'd have measured per-consumer latency before rollout. The fifty downstream integrations had their own latency dashboards, and the patterns of improvement weren't uniform, some teams saw bigger wins than others depending on which fields they queried. Sharing per- consumer breakdowns at the rollout review would have built more goodwill and surfaced two consumer-side issues sooner.

I'd have written up the flamegraph methodology as an internal guide. Three engineers asked me how I'd done it after the fact. The methodology (what to instrument, how to gate it, what to look at) was reusable across many endpoints. Documenting it would have multiplied the work's effect.

This is the story of one of those moments, and the six weeks of instrumentation that made it possible.

The starting state

The numbers when I picked it up:

Percentile	Latency	Pain felt at
p50	mid three-digit ms	the median user, every day
p75	low four-digit ms	the average user occasionally
p95	high four-digit ms	the listing-page bounce rate
p99	low single-digit seconds	the partner-API SLO

Constraints

Don't break the consumers. Fifty downstream integrations, most with their own caches and client-side assumptions about response shape. Wire-level changes were on the table; semantic changes were not.
Production data only. Synthetic load told us nothing about the shape of real traffic, request distribution, cache-hit ratios, time- of-day patterns.
Two weeks of instrumentation budget. I'd promised the squad a measurable win in six weeks. Two of those weeks went into making the thing measurable.

Approach

Six weeks, three phases.

Weeks 1–2: instrumentation

I went in and added:

Per-resolver timing, attached to traces as span attributes. The GraphQL Tools resolver-level instrumentation already existed, it was just turned off behind a feature flag because of overhead concerns. Profiled that overhead: 0.4ms p50. Turned it on globally.
Per-dataloader batch size and key cardinality, captured as histograms. We had four dataloaders inside this resolver chain; exactly none of them were instrumented.
A flamegraph endpoint behind an internal-only header that ran pprof on the running Node process and shipped the SVG to a scratch S3 bucket. Production-safe (read-only, sampled, gated).
A traffic mirror to a non-production replica, so we could replay representative production load against changes without risking the real users.

Two weeks. Boring work. Without it, everything that followed would have been a guess.

Weeks 3–4: investigation

The flamegraph that mattered showed up on day fifteen.

Twelve percent of the median request, meaningful chunk of a three- digit-ms budget, was being spent inside a single function:

function attachVehicleAttributes(listings: Listing[], lookups: AttributeMap) {
  return listings.map((listing) => ({
    ...listing,
    attributes: enrichAttributes(listing, lookups),
  }));
}

// before, a clone for every listing's enrichment
return listings.map(listing => ({
  ...listing,
  attributes: enrichAttributes(listing, lookups),
}));

// after, the lookup map is read-only, share it
const sharedLookups = Object.freeze(lookups);
return listings.map(listing => ({
  ...listing,
  attributes: enrichAttributes(listing, sharedLookups),
}));

That was the biggest single change. Two more emerged from the same flamegraph:

Dataloader cache leakage. One of the four dataloaders was being instantiated per-request when it should have been per-context. Its cache was getting blown away on every request. The fix was a 6-line change in the context factory.
A redundant authorization check. The graph layer was calling the auth service twice, once to load the user, once to verify the user could access the listings. The second call was always a no-op because the auth tokens included a scope claim. Eliminated entirely.

Weeks 5–6: rollout and validation

It didn't trip.

Outcome

Percentile	Before	After	Delta
p50	mid three-digit ms	low three-digit ms	~30% reduction
p75	low four-digit ms	mid three-digit ms	similar
p95	high four-digit ms	low four-digit ms	smaller, but real
p99	low single-digit seconds	low single-digit seconds	unchanged, the long tail was elsewhere