Replacing a live GraphQL API without a cutover, Apoorv Mittal

The cleanest way to migrate a GraphQL server is not to migrate it.

We had a public read API powering the marketplace, listings, search, pricing, dealer cache, the lot. The framework had drifted out of support and the incentives to upgrade in place were terrible: dozens of consumers across web, native, and partner integrations, no maintenance window on offer, and a regression bar of "users won't notice." Every in-place upgrade plan I sketched ended in a freeze, a big-bang weekend, or both.

So I stopped sketching upgrades. We built public-search-api-v2 as an entirely new service, graphql-yoga v5 + hono, no shared code with v1, and ran it in parallel for four months while the data pipeline talked to both. By the time we touched real consumer traffic, v2's response for every production query had already been replayed against v1's at scale, and we knew which fields disagreed and by how much.

Three mechanics made that work.

Mechanic 1: a parallel codebase, not a refactor

The temptation with a strangler-fig migration is to keep one foot in the old codebase. Share types, share clients, share resolver glue, all to "make the migration easier." Every time I've seen that done, the old code drags the new code into its design constraints.

We didn't share. v2 was a new repository with a new HTTP framework (hono), a new GraphQL server (graphql-yoga v5, fetch-based to align with hono), a new runtime validation layer (arktype for DynamoDB record shapes), an SDL-first typing strategy (gql.tada + @graphql-codegen over rootSchema.graphql), and a new context model, eight DataLoaders instantiated per-request in the yoga context factory: listings, sellers, external customers, finance and insurance, exclusive offers, OCS info, OCS twin listings, price evaluation.

The first foundation PR landed mid-November. The first resolver at v1 parity, listings via DataLoader, landed a week later. From there it was a month of one-resolver-per-PR: vehicle taxonomy, price history, OCS twin listings, exclusive offers, seals, translations, dealer cache, price info, cost model, SuperDeal, vehicle engine, financing, seller, search results (searchByFilters), ranking, equipment, media, location, Toguru toggles + userData. Each PR carried a should match between v1 and v2 test. The list is long because the API surface is wide; the invariant on every PR was small, behave like v1.

The cost of a parallel codebase is real. You re-implement plumbing. You burn calendar time before shipping a single user-visible thing. You discover that some "obvious" v1 behaviors were never specified anywhere, they were the resolver code. The benefit, paid back later, is that v2's design is shaped by what v2 needs, not by what v1 had.

Mechanic 2: a shadow-diff pipeline as the confidence engine

Resolver-by-resolver parity tests catch what you remember to test. The unknowns, production query shapes nobody on the team has written down, long-tail clients with weird selection sets, locale-specific quirks, only show up against real traffic.

So in mid-January we built the diffing pipeline. It does one thing: every production request that hits v2 is also replayed against v1 asynchronously, the responses are diffed with jsondiffpatch, and both diff size and diff percentage are emitted as Datadog metrics.

The mechanic in code is small:

// app.ts, clone the request before yoga consumes it
const cloned = diffingEnabled ? request.clone() : null;
const response = await yoga.fetch(request, env);

if (cloned) {
  // setTimeout fires after the response is sent, never blocks the user
  setTimeout(() => performQueryDiff(cloned, response.clone()), 0);
}

return response;

performQueryDiff pulls query + variables from the cloned request, calls fetchFromLegacyService, normalises a few known- inconsistent fields (the adTargeting JSON string was the worst offender), runs the diff with object-hash array matching, emits graph_v1_v2_diff.diff_size and graph_v1_v2_diff.diff_percentage, and optionally persists the diff to S3 keyed by content hash so repeats deduplicate.

Two Toguru toggles control it independently:

enable-query-diffing-v2, turn the pipeline on
enable-query-diffing-v2-save-s3, control S3 storage

Splitting them mattered more than I expected. The metrics were cheap; the S3 writes were the noisy thing. We ran with metrics on 24/7 and S3 on for windows when we wanted to inspect. A small web UI at /tools/diffs browses the persisted diffs. Most days the team didn't open it; the days they did, it was the first thing they reached for.

The diffing pipeline is the part of this migration I'd lift wholesale into a future project. It's the reason "should we cut traffic over" became a question with a numeric answer instead of a feel.

Mechanic 3: schema drift on two rails

A v1/v2 strangler fig has a quiet failure mode: v1's schema keeps evolving while v2 is being built. You think you have parity; you don't, because someone added a field on a Tuesday.

We caught this with two parallel mechanisms.

Hourly drift monitoring. A scheduled job introspects the live v1 schema and diffs it against v2's rootSchema.graphql. Drift posts to Slack. It doesn't fail anything, it's a pure signal. The job is boring because the PR gate below catches almost everything; the hourly run is the safety net for changes that bypass the PR gate (hotfixes, manual deploys, anything weird).

PR validation that blocks merge. Any PR to main that touches the v2 schema runs a comparison against the main branch and posts a comment classifying every change as additive, breaking, or dangerous. Breaking and dangerous changes block merge until someone with the right scope acknowledges them.

Two rails because the failure modes are different. The PR gate stops people from shipping breaking changes inside the project; the hourly job catches changes that happen outside it.

What I'd do differently

The diffing pipeline would land in week one, not week ten. We spent six weeks building resolvers in parallel without anything production-shaped to validate against. The parity tests were useful but bounded; the diff pipeline pointed at shadow traffic, even shaped from staging, would have surfaced the adTargeting normalisation, the price-info field mismatches, and a handful of locale quirks two months earlier. Build the confidence engine before you build the thing it has confidence in.

Custom yoga plugins should ship alongside their first use, not later. useOperationCounter, useOperationErrorCounter, the structured AppError with the expected flag, the response logger middleware, those landed in February, after three months of resolvers without observability matching v1's. By the time they were in we'd already debugged a couple of incidents on partial data. They should have been the third or fourth PR, not the thirtieth. Leaving the framework defaults means recreating them; budget for that on day one or you spend it later under load.

Be stricter about which v1 quirks v2 inherits. Some v1 responses had inconsistencies the original team had filed under "clients tolerate it." We diffed those into v2 because v1 was the oracle. By month three we were maintaining "deliberate non-parity" notes, fields where v2 was correct and v1 was the bug. A cleaner policy from day one would have been: parity is the default; intentional deviations get documented with a link to the consumer who needs to notice.

If you're considering a similar migration and want to compare notes, the contact page has the easiest channels.