On How Our OpenSearch Just Died

By Radosław Miernik · Published on 28 August 2024 · Comment on Reddit

Intro

Have you ever had a chance to use a computer while it ran out of disk space? I only did a couple of times back when I was twelve years old, but I still vividly remember the crazy things happening – programs not starting or just crashing, loading states appearing out of thin air, and eventually, an inevitable shutdown. Luckily, booting from a pen drive and freeing some space worked.

And what if there were no computer to plug into? That’s the whole cloud most of us heavily invested in, right? It’s not a problem! Managed services like AWS OpenSearch can be scaled, restarted, and reconfigured with a few clicks.

Or so I thought.

Background

Let’s set up the scene: a managed OpenSearch cluster with only a single r7g.large.search node (or r6g, I can’t remember now). We even updated it rather frequently, even though the new versions brought nothing interesting to us. Everything was working flawlessly for a couple of years.

As for the traffic, it had a constant influx of small documents (<1kB on average) getting created and updated (no deletes). To be precise, it was replicated from one of the collections from MongoDB using monstache.

The entire dataset was slowly but steadily growing over the years. From time to time, we saw spikes (e.g., adding a new field), but all of them were expected and planned. This is not really because of the impact of the data size but mostly because of the reindexing cost (it slows down the search).

The incident

One of the new features included adding a new field – nothing crazy, we do that rather often. This one was slightly special, as the migration to calculate it for the existing documents was expected to take 2-3 hours (it did).

I started the migration and monitored the system for two more hours to see whether the database impact did not slow down the rest of the system too much, and, of course, looked out for potential problems. Everything was great: no performance issues, no errors, and nothing worrying. So I clocked out.

The next day, I saw that the search was in a… Really weird state. On the one hand, indexing didn’t work (constant 50x HTTP errors), and the cluster was in the “red status”. On the other, searching worked for some customers, returning only partial results. For others, the results are empty.

You guessed it – we ran out of disk space. Literally two minutes later, an update of the cluster is already scheduled. The only goal is to throw in more disk.

The problem

I waited half an hour. All changes to the OpenSearch clusters always took more than a while, so I thought it was still going. An hour later, the update was still going. Around that time, I started writing down the alternatives.

Two hours in, we’re done waiting – we’ll set up a new cluster and scrape the old one. It’s not a problem, as the data is replicated from MongoDB, and we can always resync it again if needed. (Even though it’ll involve some delay to complete recovery.) In the meantime, the app has been reconfigured to use MongoDB-based search – the results were less accurate, but at least complete.

Four hours after scheduling the reconfiguration, the cluster is still stuck in the updating process. For some reason, the new cluster also took about an hour or an hour and a half to set up. We’re (again) done waiting – let’s delete the first one. And you guessed it: it failed at that too. It was stuck for about three hours and then magically disappeared. Finally, we were done waiting.

Closing thoughts

First of all, yes, we had it coming – zero disk-related alerts were a rookie mistake. This time, we created those right away. And sure, I should have monitored it during the migration as well.

Second of all… What the hell, AWS!? I know we messed up big time, but I doubt the cluster would have updated even if we had waited any longer. I’m still a big fan of managed services and the cloud in general, but days like this are tough.

I wonder whether this stuck cluster got actually deleted or whether it simply disappeared from our account… I guess we’ll never know.

Let’s hope it did.