Migrating a service with feature flags

Table of Contents

Context #

I recently encountered a situation where I had to migrate some logic and production data from a legacy service (and database) into a new one.

I had already extracted all relevant code into a new service, ensuring its API was 100% compatible with the old service, then deployed it all the way to production: now I needed a strategy to migrate its data.

Data migrations #

Years ago I have written about database migrations with AWS DMS and that remains a solid approach for long-lasting migrations when you

cannot hope to update all your consumers quickly
the volume of data does not allow you to manually dump data, or
you cannot afford downtime of any sort (read, write)

The situation I recently encountered had different constraints:

the new service had to become, for business reasons, the source of truth within days (the sooner the better)
read downtime was not acceptable, but 1-2 hours of planned write downtime were
there were only a few tens of thousands of rows to migrate, distributed over a few relational tables.

Migration stages #

I started out by defining a set of migration stages to carry out the migration to the new service while respecting the above constraints.

--- title: Migration stages config: look: handDrawn theme: neutral --- flowchart LR; Off --> Read-only Read-only --> Off Read-only --> Shadow Shadow --> Read-only Shadow --> Live Live --> Shadow Live --> Completed Completed --> Live

Off #

The old (= legacy) service behaves as usual…

--- title: Client connecting to the legacy service config: look: handDrawn theme: neutral --- flowchart LR; A@{ shape: "sm-circ" } LS@{ label: "Legacy Service" } LD@{ shape: cyl, label: "Legacy Database" } A --> Client Client --> LS LS -.-> Client LS --> LD LD -.-> LS

… while the new one proxies all requests to the old one under the hood.

--- title: Client connecting to the new service config: look: handDrawn theme: neutral --- flowchart LR; A@{ shape: "sm-circ" } NS@{ label: "New Service" } LS@{ label: "Legacy Service" } LD@{ shape: cyl, label: "Legacy Database" } A --> Client Client --> NS NS -.-> Client NS --> LS LS -.-> NS LS --> LD LD -.-> LS

The source of truth remains the legacy service’s database (legacy database). The new database is not used at all, at this point.

We update all consumers and producers to use the new service: since it exposes the exact same API as the old service, this should in principle only require configuration changes (e.g. updating the URL they point to).

Read-only #

The new service allows read requests exactly as in the previous stage…

--- config: look: handDrawn theme: neutral --- flowchart LR; A@{ shape: sm-circ } NS@{ label: "New Service" } LS@{ label: "Legacy Service" } LD@{ shape: cyl, label: "Legacy Database" } A --> Client Client -- read --> NS NS -.-> Client NS -- read --> LS LS -.-> NS LS -- read --> LD LD -.-> LS

… but blocks write requests, returning an error to clients.

--- config: look: handDrawn theme: neutral --- flowchart LR; A@{ shape: sm-circ } NS@{ label: "New Service" } A --> Client Client -- write --> NS NS -. error .-> Client

This is when the data migration happens:

we copy all data from the legacy database into the new one, and
we ensure that they are consistent (e.g. using validation scripts)

Shadow #

The new service writes to both the legacy and new databases, to keep them in sync, but only reads from the legacy one: this is to verify that the write logic in the new service works, without any impact for clients.

Here it’s important to ensure that any errors when writing to the new database alert developers (they signal an issue that must be solved before going to the next stage), but do not impact the customer experience.

--- title: Write flow config: look: handDrawn theme: neutral --- flowchart LR; A@{ shape: sm-circ } NS@{ label: "New Service" } ND@{ shape: cyl, label: "New Database" } LS@{ label: "Legacy Service" } LD@{ shape: cyl, label: "Legacy Database" } A --> Client Client -- write --> NS NS -.-> Client NS -- write --> ND ND -.-> NS NS -- write --> LS LS -.-> NS LS -- write --> LD LD -.-> LS

--- title: Read flow config: look: handDrawn theme: neutral --- flowchart LR; A@{ shape: sm-circ } NS@{ label: "New Service" } ND@{ shape: cyl, label: "New Database" } LS@{ label: "Legacy Service" } LD@{ shape: cyl, label: "Legacy Database" } A --> Client Client -- read --> NS NS -.-> Client NS -- read --> LS NS -- (not used) --> ND ND -. (not used) .-> NS LS -.-> NS LS -- read --> LD LD -.-> LS

Live #

The new service still writes to both the legacy and new databases, but now reads from the new one.

This stage helps to catch any issues in the read logic while preserving consistency across databases, allowing us to quickly go back to the previous stage by simply changing the feature flag value back to Shadow.

--- title: Write flow (nothing changed) config: look: handDrawn theme: neutral --- flowchart LR; A@{ shape: sm-circ } NS@{ label: "New Service" } ND@{ shape: cyl, label: "New Database" } LS@{ label: "Legacy Service" } LD@{ shape: cyl, label: "Legacy Database" } A --> Client Client -- write --> NS NS -.-> Client NS -- write --> ND ND -.-> NS NS -- write --> LS LS -.-> NS LS -- write --> LD LD -.-> LS

--- title: Read flow (reads from the new database) config: look: handDrawn theme: neutral --- flowchart LR; A@{ shape: sm-circ } NS@{ label: "New Service" } ND@{ shape: cyl, label: "New Database" } LS@{ label: "Legacy Service" } LD@{ shape: cyl, label: "Legacy Database" } A --> Client Client -- read --> NS NS -.-> Client NS -- read --> ND ND -.-> NS NS -- (not used) --> LS LS -. (not used) .-> NS LS -- read --> LD LD -.-> LS

Completed #

The new service now reads and writes from/to the new database, which now becomes the source of truth and starts diverging from the legacy one. The legacy service now returns an error to clients trying to use it.

--- config: look: handDrawn theme: neutral --- flowchart LR; A@{ shape: "sm-circ" } NS@{ label: "New Service" } ND@{ shape: cyl, label: "New Database" } A --> Client Client --> NS NS -.-> Client NS --> ND ND -.-> NS

Once we confirm that there are no errors, we can decommission the legacy service and database.

Feature flags #

In its most basic form, a feature flag can be seen as a conditional statement whose evaluation can be changed while the service is running (see what are feature flags). What I needed for my purposes was a feature flag evaluating, at any point in time, to a string value selected from a set (= the migration stages).

I updated both the legacy and services to change their behaviour according to the feature flag’s value (= the current migration stage): this enabled me to move between stages without being slowed down by deployments.

Considerations #

The advantage of the solution I just described is that it’s trivial (and fast) to get back to a “safe state” in case of issues: this is both thanks to the use of feature flags and because, starting from Read-Only and up to Live (included), both databases are in sync (assuming no bugs or other failures, of course).