On Database Migrations in MongoDB

The database schema is something that tends to evolve along the code. That means, it’ll change rather rapidly at the beginning, then stay the same for a relatively long time, and finally get adjusted only from time to time. While some changes will be both easy and fast to implement (e.g., adding a field), some won’t (e.g., extracting an array field into a separate collection¹).

In a way, the nature of the project decides whether or not you need a proper migration workflow. Most MVPs don’t need one, as they are most likely fine with purging the database every time even a change is made. At the same time, designing and maintaining one is a good practice, as it builds confidence for when it’s actually needed.

Background

In this text, I’ll focus on one of our projects, as I think it’s rather unusual. It’s somehow special, as it’s been in production since late 2014 (I’m leading it since late 2015), and it never got a “stop the world” rewrite. Instead, we work on it daily, and change things step by step, doing a few deploys a week on average.

The database schema has changed significantly, and so has the stack. Its core is a Meteor application, using a MongoDB database as the only source of truth. It exposes both REST and GraphQL APIs, and – of course – both are expected to stay backward-compatible after the database schema changes.

At the time of writing, there were 264 migrations, including the one that went live today. The first one is dated 17 November 2015, meaning that we do a migration roughly every 9 days. (I have never actually checked that… That’s a lot!)

Schema

MongoDB calls itself now a “schemaless database”. I’m pretty sure that a couple of years back, they called it a “schema-free database” (this Stack Overflow question also suggests that), and I think it was slightly better. You can enforce a schema, but you don’t have to. (It’s important, as most projects do that, at least for the majority of their data².)

There are basically two ways of doing that: in the code or in the database. The former is more general, as it allows you to implement anything, including complex conditions, like “this field cannot be greater than this other field from a 3rd party API”. It’s often automated by an ORM (or ODM in MongoDB’s case) like mongoose, or a Meteor-specific package, like aldeed:collection2³.

The latter leaves the code as it is, and implements the validation inside the database. I haven’t seen it in the wild a lot, which is surprising, as it’s possible since 3.2 (late 2015). The idea is simple: define a condition (query) that every document has to meet (match). As 3.6 (late 2017) introduced the $jsonSchema query operator⁴, you can effectively have a (limited) JSON schema validation handled in the database.

Of course, it would be best to use both. The former allows you to catch the error “before it happens” and let the user know what’s the problem immediately, even before making a database request. The latter makes a great safety net, as even if someone operates on the database manually (they shouldn’t), the validation will still be there⁵.

In the project we’ll discuss here, we only do the former, using one of Vazco’s packages: universe:collection. The latter was considered for a while, but we never actually did it; mainly because people with MongoDB experience are not used to it and because it’s tricky to update it. (We do it in other projects, though.)

Synchronization

Every migration has to happen at least once and at most once. It’s easy when you have a single instance of the application, and harder when there are twenty of them (and twice as many while deploying). To cope with that, one of the containers is elected as the primary⁶ and runs the migrations.

The migration consists of two parts: a unique name and a function. The name is used to check whether it was run or not (we store that in the database too). Finally, once the migration finishes, we mark it as done in the database (failures are not saved), and run the next one if needed.

Sometimes the migration scheme involves not one but two functions – one for going forward (usually called up) and another for reverting it if needed (down). Some popular frameworks, like Ruby on Rails, allow it and can even automatically deduce the latter.

We’re not doing that, so there’s a problem you may have already spotted: what happens if the container dies while the migration is still running? Well, we have to do migrations ready for that.

Error recovery

Obviously, a migration can go wrong for a multitude of reasons; I’ll give some examples in the next paragraphs. Ideally, every migration code we run should be idempotent, i.e., “it can be run multiple times without changing the results”. Of course, it’s not always possible (or feasible); then we should try making it at least resumable. (Most idempotent operations are resumable, but not all.)

Here’s an example of a migration that is both idempotent and resumable. That means it’s fine for it to fail, as we can run it again without breaking anything⁷.

A few other examples would be field removal, typo correction (e.g., rejceted to rejected), or update in one field based on the other (e.g., make sure that the ownerId is in the memberIds list).

Now, let’s take a migration that involves some destructive (hence irreversible) action but is still resumable. (Again, a real-life example.)

In here, at least three things can go wrong. First of all, the first aggregation itself can fail by exceeding the BSON limit of 16 megabytes. We made sure it’ll fit, but if it didn’t, we’d have to do it in batches (e.g., by adding a $limit stage before, and increasing it gradually). Next, the database may fail while handling Posts.bulkWrite for any reason. If it happens, everything is fine, as the duplicates are still there and we can restart the migration without consequences. Lastly, the same goes if Category.deleteMany would fail.

However, it wouldn’t be the case if we swap the last two operations. That’s because if any of them fails, some duplicated categories are already removed, and it’s impossible to tell, which one is which – the only thing we have left is the categoryIds in the Posts collection⁸.

Another approach here would be to handle duplicates one by one. The only problem I see with this approach is that it potentially involves many more database operations, as every duplicate performs two of them.

Not always feasible

Let’s take an example that has no clear solution for making it correct without spending a lot of extra time on the implementation. At some point, our system introduced a way for 3rd parties to integrate with us – we exposed webhooks.

As we wanted to deliver it fast, we simply added a list of URLs to notify when a certain operation happened (there was only one). It got released, worked well, and we wanted to implement more events. As we were already reworking it, we wanted to add two more features: optional secret for security (passed in a header) and auditing (log who changed what and when).

This migration has two major flaws. The first one is rather obvious – if any of the Webhooks.createConfig calls fail, we will end up with multiple configurations for the same URL (after restart). The second is a little more nuanced and is not really visible here, in the migration itself.

You see, we run all of the migrations when the app starts; at this time, it already handles API calls while the migration is running. It may be the case, that the new code, based on the Webhooks collection and not the webhookUrls field, won’t trigger a webhook, as the data is not migrated yet.

Of course, it is possible to handle it properly. For example, we could deploy some intermediate version that includes the migration and code that would handle both sources of configuration. Then, after the migration is done, get rid of the unnecessary code.

We decided not to do that, as we saw that there weren’t any webhooks called during the night and we simply deployed the migration then. That’s only one of the cases where it was better to implement a non-ideal migration faster and handle it with care rather than spending more time doing it the right way.

What else can go wrong

The above examples show, that the data consistency can be severely affected when a migration fails. It can also make the app behave unexpectedly, e.g., skip a few webhooks. But what else can happen?

It can make the container unavailable by maxing its CPU usage or even break it, by causing an out-of-memory error. Similarly, it can max the CPU, RAM, or disk usage of the database, taking the entire app down (or at least slowing it down for an extended period). I don’t have any examples here, as we always try them on rather large datasets locally and/or on test environments before releasing – we spot them soon enough.

Finally, it can impact the app indirectly. Here’s one that took down the app for several hours in less than a minute after deployment.

Looks innocent, right? That’s because it is. The database alone handled it fast (millions of documents in less than 40s), but the oplog got flooded, breaking all services using it⁹, including the Meteor itself.

The way how Meteor implements real-time communication is by listening to all database operations and checking if any should be pushed to the clients. Luckily, there’s a solution that solves it completely at a cost of owning a Redis instance (less than $5 a month): cultofcoders:redis-oplog.

Closing thoughts

Migrations are not easy, but creating a viable solution for your own project gives a lot of confidence and allows you to fix things as you go. It doesn’t mean you should implement new features carelessly! It’s all about making it possible to start small and scale when needed.

Keep in mind, that the above workflow works for us, but may not be the best fit for you. We also have projects where it’s handled differently, but these deserve separate posts. It also varies between the technologies, but most of the approaches are based on some general truths.

If you’re familiar with SQL and not MongoDB, then treat the field as a column and a collection as a table. The difference is that fields don’t have to be defined upfront, i.e., you can add as many as you want of any type and you’ll be just fine. (Whether or not it is a good idea is a separate topic…)

Sure, there are cases where the data is not structured, like the e-commerce metadata or logs. However, it’s often limited to one or two “black box fields”; everything else remains well-defined (e.g., IDs, timestamps).

It’s important to remember that neither an ORM/ODM nor a Meteor-specific package will cover bulk operations, like the $merge or $out operators, as well as the bulkWrite operation.

⁴

The only problem with using validators in MongoDB pre-5.0 was that the error message didn’t say what was the problem. Now the errors are there, but the docs say they’re meant for the users and “should not be relied upon in scripts”.

⁵

It’s still possible to bypass the validation if you have certain permissions set.

⁶

Primary election is a common topic in distributed systems. As we needed a very basic solution, we rolled out our own a while ago: process-custodian. It stores the instance information in a MongoDB collection and leverages a TTL index for correctness. (We use it in production for years, although I’m not confident about the implementation.)

⁷

It may look like an artificial case, but this is actually something we needed a few times. It’s common to make flags optional in MongoDB, i.e., treat missing fields in the same way as the ones with explicitly set false. That’s natural, as in most languages casting null (or undefined) to boolean makes false. But then you’d like to create a partial index and you find out that it supports $exists: true, but doesn’t support $exists: false.

⁸

If we’d contain all of the database operations in a single transaction, then the order of these doesn’t really matter. A transaction would also make all updates “appear” at once, instead of popping out one by one. They come at some runtime cost, and it’s not a good idea to update 90% of documents in a single one as it may effectively lock the database for a while (or lead to transient errors, that require a retry). Make sure to check the “Production Considerations” page.

⁹

I think the most common external services using oplog are all sorts of BI tools because these often work only with SQL databases and MongoDB integrations are based on one-way data synchronization. (There’s also the official MongoDB BI Connector, but it’s far from perfect and I wouldn’t recommend using it.)

Table of contents

Intro