Replacing a public GraphQL API without a cutover, Apoorv Mittal

The cleanest way to migrate a GraphQL server is not to migrate it.

The graph in question was a public read API (listings, search, pricing, dealer cache) serving the marketplace through web, native iOS, native Android, partner integrations, and an internal marketplace product. The framework had drifted out of support, the resolver layer had grown organic warts on top of its abstractions, and an in-place upgrade meant either a maintenance window we couldn't take or a freeze we couldn't enforce.

So I planned for neither. We built graphql-api-v2 as an entirely new service (graphql-yoga v5 + hono, no shared code with v1) and ran it in parallel for four months while DynamoDB and the upstream APIs fed both. By the time we touched real consumer traffic, every production query that hit v2 had been replayed against v1 asynchronously and diffed field-by-field. We knew which fields disagreed and by how much before we shifted a single percent of traffic.

The starting state

graphql-api-v1 was a legacy GraphQL service whose resolver layer had ossified around earlier framework idioms. It read from a DynamoDB classified-data table (via assume-role) and called out to a constellation of internal APIs: inventory service, finance and insurance, exclusive offers, price evaluation, dealer cache (Lambda + S3), and a sprawl of vehicle-taxonomy lookups.

                    ┌──────────────────┐
   web / native /   │                  │
   funnels  ──────► │  graphql-api-v1  │  ← live in production
                    └─────────┬────────┘
                              │
                  ┌───────────┴───────────┐
                  ▼                       ▼
            ┌──────────┐            ┌──────────┐
            │ DynamoDB │            │ upstream │
            │ (classif)│            │ services │
            └──────────┘            └──────────┘

Drift had set in along three axes:

Framework: the GraphQL server was on a stack with no clean upgrade path; conventions inherited from earlier idioms had calcified into hand-rolled glue.
Type safety: types were generated, but the SDL-to-resolver path had acquired enough manual stitching to make adding a field a coordination exercise.
Observability: expected errors (listing-not-found) and real errors looked the same in metrics, which meant on-call paged for both.

Constraints

Constraint	Implication
Live public API	Web, native iOS and Android, partner integrations, internal funnels. No maintenance window on offer.
Field-level parity	Consumers depended on resolver outputs to the field; even a JSON-serialisation difference could break a downstream parser.
No shared release cadence	The consumer integrations didn't coordinate releases with us. The API had to behave consistently or roll back instantly.
Confidence at scale	A test suite couldn't enumerate the production query shapes. Confidence had to come from real traffic.

Approach

Phase 1: foundation (mid-November 2025)

Week one was scaffolding: DynamoDB access via assume-role, runtime record validation with arktype, an SDL-first schema loaded from rootSchema.graphql via yoga's createSchema(), type generation with gql.tada + @graphql-codegen, and the first DataLoader-based resolver. The architectural choices that mattered later were made here:

graphql-yoga v5 as the GraphQL server, Web Standards fetch() alignment with hono, fewer batteries-included assumptions than the legacy stack had outgrown.
SDL-first, the schema lives in one file, rootSchema.graphql, loaded by createSchema(). No annotation-driven dance.
DataLoader instantiated per-request in the yoga context factory. By steady state there were eight: listingLoader, sellerDetailsLoader, externalCustomerLoader, financingLoader, promotionLoader, stockInfoLoader, twinListingLoader, priceEvaluationLoader.

The first comparison tool, a schema diff between the live DynamoDB record shape and v1's GraphQL types, landed alongside the first resolver. It surfaced shape gaps before they became runtime surprises.

Phase 2: resolver-by-resolver parity (mid-November to mid-December)

A month of one-resolver-per-PR, each carrying a should match between v1 and v2 test against a recorded v1 fixture. Domains shipped, in order: listings, price history, vehicle taxonomy (make / model / body / fuel / engine), inventory and twin listings, exclusive offers, seals, translations, dealer cache, price info, cost model, featured promotions, financing and insurance, seller (with vendor contact data), search results (searchByFilters), ranking and tracking parameters, vehicle equipment (200+ types), webpage URL, media, price evaluation, location, feature toggles + userData forwarding.

Supporting infrastructure landed alongside: a token-caching Lambda for M2M identity tokens stored in AWS SSM, a staging deploy pipeline, and an env-validation layer pinned via arktype.

Phase 3: DataLoader optimisation (early December)

Once basic resolvers worked, round trips were the next thing to attack. Direct API calls became batched DataLoader fetches: financingLoader minified the outgoing GraphQL via gqlmin and computed cache keys ahead of time; promotionLoader fetched ahead of selection-set evaluation; stockInfoLoader consolidated 404 handling and added an isBusinessListing() guard; valuationHistoryLoader batched cleanly with no upstream changes.

By the end of this phase, a query with twenty listings made a predictable number of upstream calls, one per loader, instead of twenty-times-N.

Phase 4: shadow-diff pipeline (mid-January 2026)

This was the confidence engine.

Every production request that hit v2 was replayed against v1 asynchronously, the responses were diffed with jsondiffpatch, and both diff size and diff percentage were emitted as Datadog metrics.

The mechanic in code is small:

// app.ts, clone the request before yoga consumes it
const cloned = diffingEnabled ? request.clone() : null;
const response = await yoga.fetch(request, env);

if (cloned) {
  // setTimeout fires after the response is sent, never blocks the user
  setTimeout(() => performQueryDiff(cloned, response.clone()), 0);
}

return response;

performQueryDiff extracts query + variables, calls fetchFromLegacyService, normalises a few known-inconsistent fields (the adTargeting JSON string was the worst offender), runs the diff with object-hash array matching, emits graph_v1_v2_diff.diff_size and graph_v1_v2_diff.diff_percentage, and optionally persists the diff to S3 keyed by content hash so repeats deduplicate.

Two feature toggles control the pipeline independently:

enable-query-diffing-v2, the diff itself
enable-query-diffing-v2-save-s3, the S3 write

Splitting them mattered more than I expected. The metrics were cheap; the S3 writes were the noisy thing. We ran with metrics on 24/7 and S3 on for windows when we wanted to inspect. A small UI at /tools/diffs browsed the persisted diffs.

Phase 5: schema drift on two rails (late January)

A v1/v2 strangler fig has a quiet failure mode: v1's schema keeps evolving while v2 is being built. We caught this with two parallel mechanisms:

Hourly drift monitoring (schema-drift-monitoring.yaml) introspects the live v1 schema and diffs it against v2's rootSchema.graphql. Drift posts to Slack, pure signal, doesn't fail anything.
PR validation that blocks merge (schema-pr-validation.yaml) compares PR schema changes against the main branch and posts a comment classifying each change as additive, breaking, or dangerous. Breaking and dangerous changes block merge.

Two rails because the failure modes are different. The PR gate stops people from shipping breaking changes inside the project; the hourly job catches changes that happen outside it.

Phase 6: observability and hardening (February 2026)

Custom yoga plugins replaced what the old stack had given us out of the box:

A custom yoga logger pipes yoga's logs through pino for Datadog structured logging.
useOperationCounter tracks graphql_operations_total by operation name (introspection excluded).
useOperationErrorCounter is a global error handler via useErrorHandler; it parses the document AST to extract locale and skips expected errors.
listingNotFoundError is a structured AppError with an expected boolean flag and a StatsD metric tagged by operationName + locale, so listing-not-found stops looking like a real error.
A response-logger middleware logs 4xx/5xx from the GraphQL endpoint.
A DynamoDB projection toggle A/B-tests full-document reads against projected reads.

Phase 7: public exposure and auth (March 2026)

The repository was restructured into a Yarn-workspace + Turborepo monorepo (apps/graphql-api/). CloudFront distributions for staging and production made v2 publicly addressable. Auth middleware enforced basic auth on *.api.<platform>.com for known clients, marketplace-web (the web frontend), web-frontend, ios-app, android-app, and the partner integrations.

Phase 8: progressive cutover (ongoing)

Traffic migration is controlled by feature toggles. The enable-graphql-api-v2-shadow-traffic toggle indicates v1 can route shadow traffic to v2 selectively. Each consumer cohort moves on its own cadence, with the diff metrics as the gate.

The hard parts

adTargeting JSON-string normalisation. v1 returned a JSON-encoded string with key ordering that shifted between requests. v2's encoder ordered keys deterministically, which the diff pipeline reported as a "difference" on every single response. We added a normaliser to the diff pipeline before the comparison, but in hindsight, two days were lost to chasing what looked like real divergence.

Field-shape ambiguity in price-info. v1's price-info resolver returned slightly different shapes depending on listing type. The behaviour was undocumented; the consumers depended on it. Reproducing it in v2 required reading the v1 resolver code line-by-line, there was no spec.

Custom yoga plugins arrived late. We added the operation counter, error counter, and structured AppError in February, after three months of resolvers without observability matching v1's. By the time they landed we'd already debugged a couple of incidents on partial data.

Outcome

Metric	Before	After
v2 production response coverage	,	100% diffed
Resolvers at v1 parity	0	20+
Per-request DataLoader rationalisation	ad-hoc	8 per request
Schema drift detection	None	Hourly + PR gate
Cutover events required	,	0
Consumer-visible regressions during shift	,	0

What I'd do differently

The diffing pipeline would land in week one, not week ten. Six weeks of resolver building without anything production-shaped to validate against meant the parity tests carried more weight than they deserved. The diff pipeline pointed at shadow-shaped staging data would have surfaced the adTargeting normalisation, the price-info mismatches, and a handful of locale quirks two months earlier. Build the confidence engine before you build the thing it has confidence in.

Custom yoga plugins should ship alongside their first use, not later. Leaving the framework defaults means recreating them, operation counting, structured error handling, expected-vs-real classification. Budget for that on day one or you spend it later under load.

Be stricter about which v1 quirks v2 inherits. Some v1 responses had inconsistencies the original team had filed under "clients tolerate it." We diffed those into v2 because v1 was the oracle. By month three we were maintaining "deliberate non-parity" notes, fields where v2 was correct and v1 was the bug. A cleaner policy from day one would have been: parity is the default; intentional deviations get documented with a link to the consumer who needs to notice.

The cleanest way to migrate a GraphQL server is not to migrate it.

The starting state

                    ┌──────────────────┐
   web / native /   │                  │
   funnels  ──────► │  graphql-api-v1  │  ← live in production
                    └─────────┬────────┘
                              │
                  ┌───────────┴───────────┐
                  ▼                       ▼
            ┌──────────┐            ┌──────────┐
            │ DynamoDB │            │ upstream │
            │ (classif)│            │ services │
            └──────────┘            └──────────┘

Drift had set in along three axes:

Framework: the GraphQL server was on a stack with no clean upgrade path; conventions inherited from earlier idioms had calcified into hand-rolled glue.
Type safety: types were generated, but the SDL-to-resolver path had acquired enough manual stitching to make adding a field a coordination exercise.
Observability: expected errors (listing-not-found) and real errors looked the same in metrics, which meant on-call paged for both.

Constraints

Constraint	Implication
Live public API	Web, native iOS and Android, partner integrations, internal funnels. No maintenance window on offer.
Field-level parity	Consumers depended on resolver outputs to the field; even a JSON-serialisation difference could break a downstream parser.
No shared release cadence	The consumer integrations didn't coordinate releases with us. The API had to behave consistently or roll back instantly.
Confidence at scale	A test suite couldn't enumerate the production query shapes. Confidence had to come from real traffic.

Approach

Phase 1: foundation (mid-November 2025)

graphql-yoga v5 as the GraphQL server, Web Standards fetch() alignment with hono, fewer batteries-included assumptions than the legacy stack had outgrown.
SDL-first, the schema lives in one file, rootSchema.graphql, loaded by createSchema(). No annotation-driven dance.
DataLoader instantiated per-request in the yoga context factory. By steady state there were eight: listingLoader, sellerDetailsLoader, externalCustomerLoader, financingLoader, promotionLoader, stockInfoLoader, twinListingLoader, priceEvaluationLoader.

Phase 2: resolver-by-resolver parity (mid-November to mid-December)

Supporting infrastructure landed alongside: a token-caching Lambda for M2M identity tokens stored in AWS SSM, a staging deploy pipeline, and an env-validation layer pinned via arktype.

Phase 3: DataLoader optimisation (early December)

By the end of this phase, a query with twenty listings made a predictable number of upstream calls, one per loader, instead of twenty-times-N.

Phase 4: shadow-diff pipeline (mid-January 2026)

This was the confidence engine.

The mechanic in code is small:

// app.ts, clone the request before yoga consumes it
const cloned = diffingEnabled ? request.clone() : null;
const response = await yoga.fetch(request, env);

if (cloned) {
  // setTimeout fires after the response is sent, never blocks the user
  setTimeout(() => performQueryDiff(cloned, response.clone()), 0);
}

return response;

Two feature toggles control the pipeline independently:

enable-query-diffing-v2, the diff itself
enable-query-diffing-v2-save-s3, the S3 write

Phase 5: schema drift on two rails (late January)

A v1/v2 strangler fig has a quiet failure mode: v1's schema keeps evolving while v2 is being built. We caught this with two parallel mechanisms:

Hourly drift monitoring (schema-drift-monitoring.yaml) introspects the live v1 schema and diffs it against v2's rootSchema.graphql. Drift posts to Slack, pure signal, doesn't fail anything.
PR validation that blocks merge (schema-pr-validation.yaml) compares PR schema changes against the main branch and posts a comment classifying each change as additive, breaking, or dangerous. Breaking and dangerous changes block merge.

Two rails because the failure modes are different. The PR gate stops people from shipping breaking changes inside the project; the hourly job catches changes that happen outside it.

Phase 6: observability and hardening (February 2026)

Custom yoga plugins replaced what the old stack had given us out of the box:

A custom yoga logger pipes yoga's logs through pino for Datadog structured logging.
useOperationCounter tracks graphql_operations_total by operation name (introspection excluded).
useOperationErrorCounter is a global error handler via useErrorHandler; it parses the document AST to extract locale and skips expected errors.
listingNotFoundError is a structured AppError with an expected boolean flag and a StatsD metric tagged by operationName + locale, so listing-not-found stops looking like a real error.
A response-logger middleware logs 4xx/5xx from the GraphQL endpoint.
A DynamoDB projection toggle A/B-tests full-document reads against projected reads.

Phase 7: public exposure and auth (March 2026)

Phase 8: progressive cutover (ongoing)

The hard parts

Outcome

Metric	Before	After
v2 production response coverage	,	100% diffed
Resolvers at v1 parity	0	20+
Per-request DataLoader rationalisation	ad-hoc	8 per request
Schema drift detection	None	Hourly + PR gate
Cutover events required	,	0
Consumer-visible regressions during shift	,	0