@radekmie

On Oplog Replacement in Meteor

By Radosław Miernik · Published on · Comment on Meteor Forum, Reddit

Table of contents

Intro

Before I started working with Meteor back in 2015, basically every single web app I had ever worked with was either entirely static or refreshing periodically (e.g., using AJAX). Meteor was different – every database change was available in the browser instantly, moments after it happened.

The best part is it takes no code to make it work. Publish some subset of the database on the server, then subscribe to them on the client, and that’s it; both client and server code is as simple as a single MongoDB query. Thanks to the isomorphic (universal) code, it can be the exact same query on both ends.

As you may have guessed, it comes with a cost. Not a problem per se but rather a non-trivial performance penalty you must keep in mind while scaling your app. Let’s dive into how it works and what we can do about it.

So, how does Meteor do it?

Overview

We’ll dissect this whole real-time functionality piece by piece, but let’s start with a top-level summary of how Meteor does it:

  1. Client initializes a session.
  2. Client subscribes to a publication, i.e., sends an “I want to be notified about this now” request to the server.
  3. Server executes the initial query, storing the results in memory (so-called mergebox; more on that later), and sends it to the client.
  4. Server sets up a “database watcher” that will react to all data changes. They will be synced to the mergebox and pushed to the client.
  5. Client unsubscribes, i.e., sends an “I no longer want to be notified” request.
  6. Server closes the database watcher and removes the data from memory.

The above communication is defined in the Distributed Data Protocol (DDP), the default for client-server communication in Meteor. It’s a stateful protocol using a WebSocket1.

The exact details of the protocol are not really essential for this article; what is important is that we maintain real-time communication between the database, server, and finally – client (browser).

Mergebox

Have you heard of mergebox? You must be a seasoned Meteor user then! Funnily enough, this name never occurs in the code, only in the documentation and comments. The idea behind it is simple: Meteor stores the published state in memory to minimize the data sent to the browser. Yes, we trade the server memory (and hence computing time) for reduced network traffic2.

How does it impact the performance? I’d say worse than linearly with the amount of published data. In other words, publishing twice as much data is more than twice as expensive. And, of course, publishing a megabyte of data takes more than a megabyte of your server memory.

MongoDB Oplog

First of all, the server must know that something happened in the database. Fortunately, MongoDB has a solution just for that: operation log, or oplog. It’s a special (capped) collection, where all the operations are stored, being the backbone of replication, i.e., synchronisation between database nodes3.

What we need to make use of this data is to somehow listen to it in real-time. Once again, MongoDB has a solution: tailable cursors4. Once we set it up, our application will be notified as soon as anything happens.

With mergebox and a tailable cursor of the oplog in hand, we have to analyze every single database operation (see _handleOplogEntrySteadyOrFetching). It looks complex, but it’s really just a routing of the database changes.

However, there’s a problem – oplog is not designed to be consumed by 3rd parties, and as such, it’s not documented. Meteor, amongst others, is basically reverse-engineering the actual implementation. And it does change between MongoDB versions, e.g., in version 6.0 (see oplogV2V1Converter).

Problem

Once we conquer the oplog and can handle all the edge cases, we can finally ship it to production. We won’t publish a lot of data, make sure we have enough memory reserved (and autoscaling in place!), and we should be good; at least for a while. Right…?

Nope. You see, there’s a catch – oplog lists all of the database operations; even the ones nobody is interested in. Let’s say we publish only a handful of documents from one tiny collection. But now another utterly unrelated collection is getting hosed with millions of changes. A server has to go through all of them, as they may be interesting for someone.

What’s worse, every server has to go through every oplog message. That’s not just bad – that’s “Meteor does not scale” bad. Even if we scoped it and go only through potentially interesting messages, this “oplog flood” scenario can still happen, and it will hit us hard.

Solution

The idea of listening only to “potentially interesting” changes is good. Sure, we still can be flooded with related changes, but then we are at least grinding CPUs for a reason. We can solve that by adding an additional layer between the database and servers that would handle that.

And that’s how cultofcoders:redis-oplog was born. It replaces the database observing with listening to Redis, using its built-in support for pubsub (publish and subscribe messaging). We can listen to only two types of changes then:

  1. ${collection}, when subscribed by a generic query.
  2. ${collection}::${id}, when subscribed by ID.

We have to push the changes there too, but that’s also simple – every change is published to both of the above, along with the operation (insert, update, remove). Even the document ID alone is enough, as we can still get the latest version from the database.

Even though we had to query the database after we heard about a change from Redis, it became a game changer for Meteor applications. It literally saved tens of projects I worked on. And all of that was built by the community.

Solution, again

Can we do better? The above solution looks great, but something has to push the data to Redis. And as long as we know what to push, it’s okay. But as soon as you let others operate on your database, you’re doomed you have to take care of it yourself. It’s a hassle, trust me.

Can we, perhaps, automate it? Meteor community, once again, came up with a solution: oplogtoredis. It’s a simple program that listens to oplog and pushes it to Redis in a cultofcoders:redis-oplog-compatible format. We have used it in production for years now, and I feel confident to say both are battle-tested.

Can we go even further? I’m glad you asked!

Gotta go fast

We’ve been hit with the oplog flood problem recently (again). Even with both cultofcoders:redis-oplog and oplogtoredis, a single “trivial” migration affecting millions of the most subscribed documents caused a significant slowdown of the system for an extended period.

Remember how I said we do have to query the database when a notification appears in Redis? Let’s look at the protectAgainstRaceConditions option. It says it will offer us “extreme speed” but may be problematic when you have “very large documents”. What does it do?

It publishes whole documents to Redis. Instead of pushing the notifications and then querying the database for the results, we push the actual changes. That’s why it offers better performance (no database lookup) but may cause problems with large documents (more data in Redis).

But hey, it doesn’t work with oplogtoredis

Make it happen

What if it did? That would mean an external tool listens to the database and pushes them to Redis. But not only the changes – the entire documents. And that’s not what oplog was built for – we’d need to build another mergebox-like structure for that!

But MongoDB has yet another feature: Change Streams. The Meteor community got crazy about them a while back, me included (you can find more details in meteor/meteor#11842). We can use it to get the latest documents and push them to Redis with virtually no overhead5. And I did.

That’s how changelog-to-redis got born. It’s a single-file Rust program that creates a change stream watcher and connects it with a Redis publisher. For performance reasons, there’s also a buffer, but that’s it. Oh, and it does support an ID-only mode, replacing oplogtoredis behavior almost one to one6.

It’s still an early stage – there’s (almost) zero documentation, no automated tests, or even more than one project currently using it. But as it passed an extensive E2E test suite from one app, I believe it’s good enough to give it a try. In the following days, we plan to use it on all non-production environments instead of oplogtoredis and see how it performs.

Closing thoughts

That’s a lot of engineering efforts to make Meteor scale. What if all of the above could be simplified? What if Meteor could replace the oplog tailing with change streams? That’s something I want to look into in the future.

Before changelog-to-redis was created, I was a little sceptical about it. Every change still has to be analyzed by all of the interested servers. But we could eliminate the mergebox completely and simply forward the changes from the database to the clients. Or do something smarter I didn’t come up with yet.

Will it work? I have no idea. Will I give it a try? Definitely!

1

Thanks to SockJS, it can fall back to other transports (e.g., pooling with XHR) while maintaining the same API. It’s not a strict necessity today when the WebSockets are supported in virtually all browsers, and most of us have a solid internet connection, but it was a blessing a few (or ten) years back.

2

Since Meteor 2.4, we have other options. The default one (SERVER_MERGE) still holds everything in memory, but we can do the complete opposite and send everything (NO_MERGE_NO_HISTORY). Somewhere in between is NO_MERGE, which keeps track of additions and removals. For details, see the documentation.

3

Other databases maintain a similar structure, also for other purposes. For example, PostgreSQL has a Write-Ahead Log (WAL). And while it serves a different purpose (data integrity, not replication), it would serve us too.

4

We could manage it without tailable cursors by refetching an offset-based query in a “get oplog entries newer than the last one we saw” manner. The performance would be similar, but it wouldn’t be real-time and instead rely on some arbitrary refresh rate (pooling interval).

6

Almost one to one, as oplogtoredis calculates the affected fields. However, this feature is broken since MongoDB 6.0 due to the oplog format change, and I decided not to do it at all. The app’s behavior didn’t change.

5

While the MongoDB documentation on pre- and post-images says they have a high cost, I can’t say how high. We’ve been running that in production for over two years at this point, and we didn’t notice any issues.