The cleanest way to migrate a GraphQL server is not to migrate it.
We had a public read API powering the marketplace, listings, search, pricing, dealer cache, the lot. The framework had drifted out of support and the incentives to upgrade in place were terrible: dozens of consumers across web, native, and partner integrations, no maintenance window on offer, and a regression bar of "users won't notice." Every in-place upgrade plan I sketched ended in a freeze, a big-bang weekend, or both.
So I stopped sketching upgrades. We built public-search-api-v2 as
an entirely new service, graphql-yoga v5 + hono, no shared code
with v1, and ran it in parallel for four months while the data
pipeline talked to both. By the time we touched real consumer traffic,
v2's response for every production query had already been replayed
against v1's at scale, and we knew which fields disagreed and by how
much.
Three mechanics made that work.
Mechanic 1: a parallel codebase, not a refactor
The temptation with a strangler-fig migration is to keep one foot in the old codebase. Share types, share clients, share resolver glue, all to "make the migration easier." Every time I've seen that done, the old code drags the new code into its design constraints.
We didn't share. v2 was a new repository with a new HTTP framework
(hono), a new GraphQL server (graphql-yoga v5, fetch-based to
align with hono), a new runtime validation layer (arktype for
DynamoDB record shapes), an SDL-first typing strategy (gql.tada +
@graphql-codegen over rootSchema.graphql), and a new context
model, eight DataLoaders instantiated per-request in the yoga
context factory: listings, sellers, external customers, finance and
insurance, exclusive offers, OCS info, OCS twin listings, price
evaluation.
The first foundation PR landed mid-November. The first resolver at v1
parity, listings via DataLoader, landed a week later. From there it
was a month of one-resolver-per-PR: vehicle taxonomy, price history,
OCS twin listings, exclusive offers, seals, translations, dealer
cache, price info, cost model, SuperDeal, vehicle engine, financing,
seller, search results (searchByFilters), ranking, equipment,
media, location, Toguru toggles + userData. Each PR carried a
should match between v1 and v2 test. The list is long because the
API surface is wide; the invariant on every PR was small, behave
like v1.
The cost of a parallel codebase is real. You re-implement plumbing. You burn calendar time before shipping a single user-visible thing. You discover that some "obvious" v1 behaviors were never specified anywhere, they were the resolver code. The benefit, paid back later, is that v2's design is shaped by what v2 needs, not by what v1 had.
Mechanic 2: a shadow-diff pipeline as the confidence engine
Resolver-by-resolver parity tests catch what you remember to test. The unknowns, production query shapes nobody on the team has written down, long-tail clients with weird selection sets, locale-specific quirks, only show up against real traffic.
So in mid-January we built the diffing pipeline. It does one thing:
every production request that hits v2 is also replayed against v1
asynchronously, the responses are diffed with jsondiffpatch, and
both diff size and diff percentage are emitted as Datadog metrics.
The mechanic in code is small:
// app.ts, clone the request before yoga consumes it
const cloned = diffingEnabled ? request.clone() : null;
const response = await yoga.fetch(request, env);
if (cloned) {
// setTimeout fires after the response is sent, never blocks the user
setTimeout(() => performQueryDiff(cloned, response.clone()), 0);
}
return response;
performQueryDiff pulls query + variables from the cloned
request, calls fetchFromLegacyService, normalises a few known-
inconsistent fields (the adTargeting JSON string was the worst
offender), runs the diff with object-hash array matching, emits
graph_v1_v2_diff.diff_size and graph_v1_v2_diff.diff_percentage,
and optionally persists the diff to S3 keyed by content hash so
repeats deduplicate.
Two Toguru toggles control it independently:
enable-query-diffing-v2, turn the pipeline onenable-query-diffing-v2-save-s3, control S3 storage
Splitting them mattered more than I expected. The metrics were cheap;
the S3 writes were the noisy thing. We ran with metrics on 24/7 and
S3 on for windows when we wanted to inspect. A small web UI at
/tools/diffs browses the persisted diffs. Most days the team didn't
open it; the days they did, it was the first thing they reached for.
The diffing pipeline is the part of this migration I'd lift wholesale into a future project. It's the reason "should we cut traffic over" became a question with a numeric answer instead of a feel.
Mechanic 3: schema drift on two rails
A v1/v2 strangler fig has a quiet failure mode: v1's schema keeps evolving while v2 is being built. You think you have parity; you don't, because someone added a field on a Tuesday.
We caught this with two parallel mechanisms.
Hourly drift monitoring. A scheduled job introspects the live v1
schema and diffs it against v2's rootSchema.graphql. Drift posts to
Slack. It doesn't fail anything, it's a pure signal. The job is
boring because the PR gate below catches almost everything; the
hourly run is the safety net for changes that bypass the PR gate
(hotfixes, manual deploys, anything weird).
PR validation that blocks merge. Any PR to main that touches
the v2 schema runs a comparison against the main branch and posts a
comment classifying every change as additive, breaking, or dangerous.
Breaking and dangerous changes block merge until someone with the
right scope acknowledges them.
Two rails because the failure modes are different. The PR gate stops people from shipping breaking changes inside the project; the hourly job catches changes that happen outside it.
What I'd do differently
The diffing pipeline would land in week one, not week ten. We
spent six weeks building resolvers in parallel without anything
production-shaped to validate against. The parity tests were useful
but bounded; the diff pipeline pointed at shadow traffic, even
shaped from staging, would have surfaced the adTargeting
normalisation, the price-info field mismatches, and a handful of
locale quirks two months earlier. Build the confidence engine before
you build the thing it has confidence in.
Custom yoga plugins should ship alongside their first use, not
later. useOperationCounter, useOperationErrorCounter, the
structured AppError with the expected flag, the response logger
middleware, those landed in February, after three months of
resolvers without observability matching v1's. By the time they were
in we'd already debugged a couple of incidents on partial data. They
should have been the third or fourth PR, not the thirtieth. Leaving
the framework defaults means recreating them; budget for that on day
one or you spend it later under load.
Be stricter about which v1 quirks v2 inherits. Some v1 responses had inconsistencies the original team had filed under "clients tolerate it." We diffed those into v2 because v1 was the oracle. By month three we were maintaining "deliberate non-parity" notes, fields where v2 was correct and v1 was the bug. A cleaner policy from day one would have been: parity is the default; intentional deviations get documented with a link to the consumer who needs to notice.
If you're considering a similar migration and want to compare notes, the contact page has the easiest channels.