Built v2 as a parallel service instead of upgrading v1 in place, Apoorv Mittal

Context

A public GraphQL read API on a stack drifting out of support. The resolver layer had calcified around earlier framework idioms; adding a field had become a coordination exercise. Fifty-plus consumers, web, native iOS, native Android, partner integrations, and an internal marketplace product, none coordinating releases with us.

The team wanted to upgrade the framework. Every in-place plan I sketched ended in a freeze, a maintenance window, or both. The previous attempt (eighteen months earlier, before my time) had stalled for the same reason.

The two options I weighed

Option	Risk profile	Cost
In-place upgrade with feature flags	High, every consumer is one rollback away from breaking; framework upgrade and resolver-shape changes are coupled	Lower in plumbing; very high in coordination
Strangler-fig: build v2 in a new repo, run in parallel, cut over per consumer behind toggles	Low at every step, v1 keeps serving traffic; cutover is reversible by toggle	High in plumbing, re-implement DataLoaders, context, validation by hand

The in-place option had two non-obvious risks. First: the framework upgrade and the consumer-visible behaviour were coupled. Flipping the framework also re-shaped resolver outputs in subtle ways (serialisation order, error envelopes, default field handling). Every consumer was implicitly betting that v2's outputs matched v1's, with no way to verify before the flip. Second: every previous in-place upgrade in the team's history had stalled because someone needed a freeze that nobody could grant.

The decision

Build v2 in a separate repository, on a different framework (graphql-yoga + hono), as a sibling service to v1. No shared code with v1. Run in parallel for as long as it takes to validate. Cut over per consumer cohort behind feature toggles, with diff metrics as the gate.

I reached this by inverting the risk question. Instead of "how do we make an in-place upgrade safe?" I asked "what would have to be true for a parallel-build to be wrong?" The honest answer was only the plumbing cost. I could carry that.

What I gave up

Re-implementing every DataLoader, every context binding, every runtime validation, by hand, in a new codebase, while v1 kept serving traffic. Six weeks of resolver-by-resolver porting before any of it was production-shaped. A build budget that didn't translate to visible feature progress for a quarter.

The cleaner-looking alternative, share types or clients between v1 and v2 to "make the migration easier", is the version of strangler-fig that fails. Every time I've seen it tried, the old code's design constraints leak into the new code. I forced the discipline of zero-shared-code from day one.

What played out

The framework upgrade and the consumer migration decoupled cleanly. v2 could be wrong in dozens of ways without any consumer noticing, because v1 was still serving traffic. Once the shadow-diff pipeline landed in month three, the divergences it surfaced were the kind no parity test would have caught: the adTargeting JSON-string key-ordering quirk, the price-info field-shape variance, a handful of locale-specific edge cases.

The plumbing cost was real. Phase 1 produced no user-visible feature work for three weeks. The kind of decision a less experienced team might have rolled back inside the first month. I'd defend it again.

What I'd do differently

Build the diff pipeline in week one, not week ten. The strangler-fig framing was correct, but the confidence engine (the v1↔v2 diff) should land before the resolvers, not after. Six weeks of resolver work without anything production-shaped to validate against meant the parity tests carried more weight than they deserved. Pointing the diff at shadow-shaped staging data from day one would have surfaced the worst divergences two months earlier.

Context

The two options I weighed

Option	Risk profile	Cost
In-place upgrade with feature flags	High, every consumer is one rollback away from breaking; framework upgrade and resolver-shape changes are coupled	Lower in plumbing; very high in coordination
Strangler-fig: build v2 in a new repo, run in parallel, cut over per consumer behind toggles	Low at every step, v1 keeps serving traffic; cutover is reversible by toggle	High in plumbing, re-implement DataLoaders, context, validation by hand

The decision

What I gave up

What played out

What I'd do differently