software engineer

progressive migrations & long tails

Posted at — Sep 26, 2020

You need to switch out the database that you’re operating on. Maybe new requirements means you need a different type of database. Maybe you’re rearchitecting because your main database cluster can’t take yet another service hammering it.

Okay, but theres a problem - you can’t have your system face any downtime. Maybe this is because you work for uber eats. And if uber eats went down for maintenance, millennials would starve. How are we going to get this done?

Progressive migrations

The solution to your problem; progressive migrations. Put in additional logic on the path that accesses the data that you want to migrate. This logic will introduce a new data access mode - as we access the data, auxiliary logic will attempt to grab it from the new datastore. If it hasn’t been migrated to the new datastore yet; it will read from the existing datastore & be moved over to the new datastore before servicing the request.

Over time, the new datastore will fill up with the data. Once everything has been migrated over; we can switch off the data access mode from “progressive migration” to “new datastore”. Boom, no downtime.

Long tails

In computer systems, data access frequencies tend to follow the pareto distribution. That is to say; the minority of entities (users) are responsible for the vast majority of data accesses. This is more commonly known as the 80%/20% rule.


As a result of this distribution; most of our data won’t be accessed and may not be used for a long time. This manifests as a long tail of yet-to-be migrated data. You’ll have to keep this in mind when implementing this strategy.

After the bulk of the migration has been done and the burndown of unmigrated data has slowed or halted, bust out emacs & write a perl script to hit the rest of the data & trigger the latent migrations. The first script you will have to run is one that finds a list of how many entities haven’t been migrated yet, and the second will run through them and trigger them in a manner that doesn’t overload your system.

If you had a graph that measured the count of unmigrated entities over time; it would look something like this: