The cleanest way to migrate a GraphQL server is not to migrate it.
The graph in question was a public read API (listings, search, pricing, dealer cache) serving the marketplace through web, native iOS, native Android, partner integrations, and an internal marketplace product. The framework had drifted out of support, the resolver layer had grown organic warts on top of its abstractions, and an in-place upgrade meant either a maintenance window we couldn't take or a freeze we couldn't enforce.
So I planned for neither. We built graphql-api-v2 as an
entirely new service (graphql-yoga v5 + hono, no shared code with
v1) and ran it in parallel for four months while DynamoDB and the
upstream APIs fed both. By the time we touched real consumer traffic,
every production query that hit v2 had been replayed against v1
asynchronously and diffed field-by-field. We knew which fields disagreed
and by how much before we shifted a single percent of traffic.
The starting state
graphql-api-v1 was a legacy GraphQL service whose resolver
layer had ossified around earlier framework idioms. It read from a
DynamoDB classified-data table (via assume-role) and called out to a
constellation of internal APIs: inventory service, finance and insurance,
exclusive offers, price evaluation, dealer cache (Lambda + S3), and a
sprawl of vehicle-taxonomy lookups.
┌──────────────────┐
web / native / │ │
funnels ──────► │ graphql-api-v1 │ ← live in production
└─────────┬────────┘
│
┌───────────┴───────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ DynamoDB │ │ upstream │
│ (classif)│ │ services │
└──────────┘ └──────────┘
Drift had set in along three axes:
- Framework: the GraphQL server was on a stack with no clean upgrade path; conventions inherited from earlier idioms had calcified into hand-rolled glue.
- Type safety: types were generated, but the SDL-to-resolver path had acquired enough manual stitching to make adding a field a coordination exercise.
- Observability: expected errors (listing-not-found) and real errors looked the same in metrics, which meant on-call paged for both.
Constraints
| Constraint | Implication |
|---|---|
| Live public API | Web, native iOS and Android, partner integrations, internal funnels. No maintenance window on offer. |
| Field-level parity | Consumers depended on resolver outputs to the field; even a JSON-serialisation difference could break a downstream parser. |
| No shared release cadence | The consumer integrations didn't coordinate releases with us. The API had to behave consistently or roll back instantly. |
| Confidence at scale | A test suite couldn't enumerate the production query shapes. Confidence had to come from real traffic. |
Approach
Phase 1: foundation (mid-November 2025)
Week one was scaffolding: DynamoDB access via assume-role, runtime
record validation with arktype, an SDL-first schema loaded from
rootSchema.graphql via yoga's createSchema(), type generation with
gql.tada + @graphql-codegen, and the first DataLoader-based
resolver. The architectural choices that mattered later were made here:
graphql-yogav5 as the GraphQL server, Web Standardsfetch()alignment withhono, fewer batteries-included assumptions than the legacy stack had outgrown.- SDL-first, the schema lives in one file,
rootSchema.graphql, loaded bycreateSchema(). No annotation-driven dance. DataLoaderinstantiated per-request in the yoga context factory. By steady state there were eight:listingLoader,sellerDetailsLoader,externalCustomerLoader,financingLoader,promotionLoader,stockInfoLoader,twinListingLoader,priceEvaluationLoader.
The first comparison tool, a schema diff between the live DynamoDB record shape and v1's GraphQL types, landed alongside the first resolver. It surfaced shape gaps before they became runtime surprises.
Phase 2: resolver-by-resolver parity (mid-November to mid-December)
A month of one-resolver-per-PR, each carrying a should match between v1 and v2 test against a recorded v1 fixture. Domains shipped, in
order: listings, price history, vehicle taxonomy (make / model / body /
fuel / engine), inventory and twin listings, exclusive offers, seals,
translations, dealer cache, price info, cost model, featured promotions,
financing and insurance, seller (with vendor contact data), search
results (searchByFilters), ranking and tracking parameters,
vehicle equipment (200+ types), webpage URL, media, price evaluation,
location, feature toggles + userData forwarding.
Supporting infrastructure landed alongside: a token-caching Lambda for
M2M identity tokens stored in AWS SSM, a staging deploy pipeline, and
an env-validation layer pinned via arktype.
Phase 3: DataLoader optimisation (early December)
Once basic resolvers worked, round trips were the next thing to attack.
Direct API calls became batched DataLoader fetches:
financingLoader minified the outgoing GraphQL via gqlmin
and computed cache keys ahead of time; promotionLoader fetched
ahead of selection-set evaluation; stockInfoLoader consolidated 404
handling and added an isBusinessListing() guard; valuationHistoryLoader
batched cleanly with no upstream changes.
By the end of this phase, a query with twenty listings made a predictable number of upstream calls, one per loader, instead of twenty-times-N.
Phase 4: shadow-diff pipeline (mid-January 2026)
This was the confidence engine.
Every production request that hit v2 was replayed against v1
asynchronously, the responses were diffed with jsondiffpatch, and
both diff size and diff percentage were emitted as Datadog metrics.
The mechanic in code is small:
// app.ts, clone the request before yoga consumes it
const cloned = diffingEnabled ? request.clone() : null;
const response = await yoga.fetch(request, env);
if (cloned) {
// setTimeout fires after the response is sent, never blocks the user
setTimeout(() => performQueryDiff(cloned, response.clone()), 0);
}
return response;
performQueryDiff extracts query + variables, calls
fetchFromLegacyService, normalises a few known-inconsistent
fields (the adTargeting JSON string was the worst offender), runs
the diff with object-hash array matching, emits
graph_v1_v2_diff.diff_size and graph_v1_v2_diff.diff_percentage,
and optionally persists the diff to S3 keyed by content hash so
repeats deduplicate.
Two feature toggles control the pipeline independently:
enable-query-diffing-v2, the diff itselfenable-query-diffing-v2-save-s3, the S3 write
Splitting them mattered more than I expected. The metrics were cheap;
the S3 writes were the noisy thing. We ran with metrics on 24/7 and
S3 on for windows when we wanted to inspect. A small UI at
/tools/diffs browsed the persisted diffs.
Phase 5: schema drift on two rails (late January)
A v1/v2 strangler fig has a quiet failure mode: v1's schema keeps evolving while v2 is being built. We caught this with two parallel mechanisms:
- Hourly drift monitoring (
schema-drift-monitoring.yaml) introspects the live v1 schema and diffs it against v2'srootSchema.graphql. Drift posts to Slack, pure signal, doesn't fail anything. - PR validation that blocks merge (
schema-pr-validation.yaml) compares PR schema changes against the main branch and posts a comment classifying each change as additive, breaking, or dangerous. Breaking and dangerous changes block merge.
Two rails because the failure modes are different. The PR gate stops people from shipping breaking changes inside the project; the hourly job catches changes that happen outside it.
Phase 6: observability and hardening (February 2026)
Custom yoga plugins replaced what the old stack had given us out of the box:
- A custom yoga logger pipes yoga's logs through
pinofor Datadog structured logging. useOperationCountertracksgraphql_operations_totalby operation name (introspection excluded).useOperationErrorCounteris a global error handler viauseErrorHandler; it parses the document AST to extract locale and skips expected errors.listingNotFoundErroris a structuredAppErrorwith anexpectedboolean flag and a StatsD metric tagged byoperationName+locale, so listing-not-found stops looking like a real error.- A response-logger middleware logs 4xx/5xx from the GraphQL endpoint.
- A DynamoDB projection toggle A/B-tests full-document reads against projected reads.
Phase 7: public exposure and auth (March 2026)
The repository was restructured into a Yarn-workspace + Turborepo
monorepo (apps/graphql-api/). CloudFront distributions for staging
and production made v2 publicly addressable. Auth middleware enforced
basic auth on *.api.<platform>.com for known clients,
marketplace-web (the web frontend), web-frontend,
ios-app, android-app, and the partner integrations.
Phase 8: progressive cutover (ongoing)
Traffic migration is controlled by feature toggles. The
enable-graphql-api-v2-shadow-traffic toggle indicates v1 can route shadow
traffic to v2 selectively. Each consumer cohort moves on its own
cadence, with the diff metrics as the gate.
The hard parts
adTargeting JSON-string normalisation. v1 returned a JSON-encoded
string with key ordering that shifted between requests. v2's encoder
ordered keys deterministically, which the diff pipeline reported as a
"difference" on every single response. We added a normaliser to the
diff pipeline before the comparison, but in hindsight, two days were
lost to chasing what looked like real divergence.
Field-shape ambiguity in price-info. v1's price-info resolver returned slightly different shapes depending on listing type. The behaviour was undocumented; the consumers depended on it. Reproducing it in v2 required reading the v1 resolver code line-by-line, there was no spec.
Custom yoga plugins arrived late. We added the operation counter,
error counter, and structured AppError in February, after three
months of resolvers without observability matching v1's. By the time
they landed we'd already debugged a couple of incidents on partial
data.
Outcome
| Metric | Before | After |
|---|---|---|
| v2 production response coverage | , | 100% diffed |
| Resolvers at v1 parity | 0 | 20+ |
| Per-request DataLoader rationalisation | ad-hoc | 8 per request |
| Schema drift detection | None | Hourly + PR gate |
| Cutover events required | , | 0 |
| Consumer-visible regressions during shift | , | 0 |
What I'd do differently
The diffing pipeline would land in week one, not week ten. Six
weeks of resolver building without anything production-shaped to
validate against meant the parity tests carried more weight than they
deserved. The diff pipeline pointed at shadow-shaped staging data
would have surfaced the adTargeting normalisation, the price-info
mismatches, and a handful of locale quirks two months earlier. Build
the confidence engine before you build the thing it has confidence in.
Custom yoga plugins should ship alongside their first use, not later. Leaving the framework defaults means recreating them, operation counting, structured error handling, expected-vs-real classification. Budget for that on day one or you spend it later under load.
Be stricter about which v1 quirks v2 inherits. Some v1 responses had inconsistencies the original team had filed under "clients tolerate it." We diffed those into v2 because v1 was the oracle. By month three we were maintaining "deliberate non-parity" notes, fields where v2 was correct and v1 was the bug. A cleaner policy from day one would have been: parity is the default; intentional deviations get documented with a link to the consumer who needs to notice.