On How We Moved to Kubernetes

By Radosław Miernik · Published on 25 February 2025 · Comment on Reddit

Intro

Have you heard of Kubernetes (also known as k8s)? Until a few months back, I knew it existed and that it was like infrastructure’s holy grail. It has to cover the basics, like auto-scaling and load balancing or automated rollbacks… And then there are millions of tools to build on top of it.

As we recently migrated our deployment from AWS Elastic Container Service (ECS) to AWS Elastic Kubernetes Service (EKS; managed Kubernetes cluster), I wanted to share some tips. It also feels nice to do that on the 10th anniversary of “Kubernetes: The Future of Cloud Hosting” MeteorHack’s blog post.

Please keep in mind that a Kubernetes cluster is an extremely complex beast, and I’m pretty far from being able to explain all the “whys” you may have. Our amazing DevOps Engineer managed to make it work, and I’m really happy with the current setup. Both because the app performs better at a lower cost and because I learned a lot along the way.

Motivation

Let’s start with our production ECS setup:

The Apollo Router cluster.
The main application cluster, split into two services (API and WebSocket). The idea is to split the load due to a drastically different characteristic; API traffic is CPU-bound, and WebSocket traffic is memory-bound.
The “extra apps” cluster, i.e., monstache (syncing data from MongoDB to ElasticSearch in real-time) and changestream-to-redis with a tiny Redis instance (for cultofcoders:redis-oplog).

Then, for the non-production environments, take the above and multiply by seven (some of the services were shared, but that only makes it worse). All of it was managed using AWS CDK, i.e., we defined almost all infrastructure parts in TypeScript (some were done using Terraform a long time ago).

With ECS, you decide on what your clusters are hosted on: AWS Fargate or Amazon EC2. In short, the former works on abstract hardware defined by the number of CPU cores and gigabytes of RAM you need, and in the latter, you have to choose from the myriad of EC2 instance types.

We went with Fargate, as it was easier to adjust when needed. As a cost-reducing measure, we used the spot launch type, i.e., highly discounted compute that cannot be taken for granted. It’s not as bad as it seems: ~65% lower cost on average at the cost of the containers restarting whenever AWS needs them.

So what was wrong? Tuesdays. And I’m serious! On Tuesdays, around noon our time, almost all of our instances were interrupted (i.e., AWS needed them). And what’s worse, we usually get “Capacity is unavailable at this time. Please try again later or in a different availability zone” error too. What does it mean?

It means your app is down. There are two containers running, even though you requested more than twenty. Yes, we tried switching back to the standard, non-spot instances. Yes, we tried adjusting the CPU-to-RAM ratio (maxing out RAM helped a little). Yes, we were using all availability zones.

And yes, your clients don’t care.

Motivation (again)

Before we started migrating the app, we had a different goal in mind. As our end-to-end test suite grew, so did the time it took. We were already running it in parallel, but increasing the parallelization also increased the overhead (some work is needed on every worker).

At some point, we moved from the GitHub’s Standard Runners to the “larger runners” (still hosted by GitHub). Obviously, it helped a lot, but it came at a price (quite literally). It was fine for the time being, but when our bill reached a certain level, we decided to do something about it.

Self-hosted runners felt like an ideal solution, but the machines would run idle at night or on the weekends, right? Alternatively, we’d have to configure auto-scaling somehow… But it sounds like a lot of work on something our customers won’t really see or benefit from.

That was a great point to start our Kubernetes journey – self-hosting the Actions Runner Controller. It’s a small utility service that listens to your CI runs and manages the nodes (servers) as needed.

Cluster overview

For starters, here’s the list of things running in our cluster:

Actions Runner Controller, along with its runners (when needed).
Argo CD, which serves as a cluster configuration frontend for some of the services. It also manages deployment – all you need to do is commit and push the desired version to a repo.
Cert Manager for, well, managing certificates.
OAuth2 Proxy for easy authorization of internal services using OAuth2.
External Secrets for direct access to the AWS Secrets Manager within the resource definitions.
Flux for continuous delivery of everything that’s not deployed with Argo (including Argo itself).
Grafana + Loki + Prometheus (with its Node Exporter) + fluentbit as our monitoring stack. We’re also looking into integrating tracing into our app, so expect Jaeger¹ to appear here soon, too.
Ingress NGINX Controller as a reverse proxy and a load balancer.
Karpenter, which provisions nodes in real-time. Its superpower is that it can check the pricing of the instances and choose the cheapest ones to cover your cluster’s needs. Amazing thing.
Keda for advanced auto-scaling, e.g., based on AWS SQS queue size. Don’t mistake it with the NGINX Ingress Controller!
Spegel for Docker registry caching.

And, of course, the application and its services, just like on ECS. That’s a lot, huh? It sure is! And trust me – it takes a lot of configuration to get it up and running. Then double that to rightsize all these services.

This shows how much managed platforms like Galaxy or Heroku actually hide from you (at a cost). In our case, it was worth the time spent on it and will pay off in less than a year (even with no growth in traffic).

The Big Switch

Migrating the non-production environments was easy since we were fine with some downtime. First, we started the Kubernetes services, then switched the DNS, and then switched off the ECS ones. In theory, this is a really safe, zero-downtime switch. And it was for non-production environments.

In production, we had a lot of issues from a rather unexpected side. There were two issues: firewall and instance type. The former is obvious: a restrictive ModSecurity configuration was an honorable but not wise choice. In total, we had to disable only a handful of rules, mostly related to incoming payloads².

The latter took us a few days of tweaking. You see, in ECS, all you can do is say “I need X CPUs and Y GBs of RAM”. We did the same with Karpenter, and it did exactly that (while optimizing the cost). The problem is that not all CPUs are equal, and it will impact your app’s performance. I highly recommend checking out Vantage’s Instances table for a nice summary.

In hindsight, we should have kept the old cluster for longer to transition the traffic more steadily. Maybe that’d buy us enough time to sort it out without such a severe downtime. We also should have started with a less restrictive firewall. Next time, we’ll all know it.

Tweaking period

Once we calmed everything down, we started tweaking. We have literally thousands of knobs to ~~play with~~ adjust! But in practice, it’s more like snakes and ladders – some will improve your app’s performance by 20% while others will take it down (seemingly for no reason).

One of the benefits of our new infrastructure is that I can easily check the history of our entire configuration. I did, and I was surprised that we did more than a hundred changes within the first month. Of course, they got smaller and smaller over time, but you get the idea.

On the other hand, rightsizing is really rewarding – seeing your cloud provider bill and the response times go down at the same time is an amazing feeling.

A quick round of things we did miss before The Big Switch:

Configure Karpenter’s disruption to prevent severe changes in high time.
Configure VPC peering with MongoDB Atlas. We had it on ECS but forgot it for the new VPCs, resulting in slightly longer network round trips and a higher data transfer bill.
Fix typos in the environmental variables. Yeah, it happens to all of us.
Increase the upper limit of auto-scaling. We started with a really low one for non-production environments and blindly copied it for production.
Switch from container_memory_usage_bytes (i.e., app’s memory including evictable caches) to container_memory_working_set_bytes (i.e., what will cause the out-of-memory error) for auto-scaling. I highly recommend this blog post for more details.

What about Meteor?

Now that you know how we got here, let’s talk about Meteor-specific stuff. First of all, the app has to have a Docker image, as that’s what we operate on in Kubernetes. But that’s basically given nowadays, so I’ll skip the details.

With a Docker image in hand, you could create a Kubernetes service directly, but I’d rather recommend basing off a Helm chart. In short, Helm takes care of some boilerplate and will make it easier to extend your service configuration with auto-scaling and similar.

That’s it, you’re done – you can deploy your app to a Kubernetes cluster, just like any other service. There’s really nothing special about Meteor or Node.js here – it works just like every piece of software you can deploy there.

“But Meteor needs sticky sessions!” If your browser doesn’t support WebSockets, SockJS falls back to XHR requests (fetch predecessor). In the past, when WebSockets were not widely supported, it was the only way. But if you don’t expect your clients to use unsupported browsers, then you’ll be just fine with DISABLE_SOCKJS=1 and without sticky sessions³.

What I recommend is using @meteorjs/ddp-graceful-shutdown (or doing the same manually). When the container receives a SIGTERM, it will disconnect users in batches instead of all at once. It helps with smoothing out the traffic spike when scaling down or during deployment.

Another suggestion is to separate the WebSocket containers from the rest. In our case, we have them separated on the ingress level: all /websocket and /sockjs (just in case) traffic goes to one group, and everything else (GraphQL API, REST API, etc.) goes to the other. (We also disabled cron jobs on the former.) As I said earlier, these traffics are completely different, and it helps to keep their metrics separate (as well as scale them independently). All this with one codebase and one Docker image.

Memory allocators

Most Node.js developers don’t know what memory allocator they’re using in their app. Even if they do, they rarely experiment with alternatives. And oh boy, there is plenty to choose from: jemalloc, mimalloc, tcmalloc…

While the API containers were well-behaved, the WebSocket containers never really reclaimed memory. With an application this complex, it’s really hard to pinpoint why. I investigated memory snapshots from production⁴, but nothing was really standing out… Because there was no memory leak to begin with.

Freeing up memory takes some CPU, so it’s smart not to do that eagerly, right? It only happens when the memory is running low (i.e., you’re close to the default limit of 2GB or whatever you set in max-old-space-size). But then it’s often too late – ~80% of CPU will be spent on garbage collection, and your app will become unresponsive. (That sometimes happened to us in ECS.)

My past experience with jemalloc suggested it would solve this problem by itself, and… It just did. Like, we deployed it with the default configuration and our average memory usage dropped by almost 20%. But more importantly, it went down when the traffic did, allowing the app to scale down for the night.

If you never tried it, do give it a try. We’re using Alpine Linux in our Docker image, and it was as easy as installing it (RUN apk add --no-cache jemalloc) and then preloading (ENV LD_PRELOAD=/usr/lib/libjemalloc.so.2). Different distros have similar configuration.

Closing thoughts

That was a long one! Well, like our infrastructure switch. Overall, it took us four months (not full-time, of course). First one was for the basic setup and CI. Next one for the non-production environments. Third was for tweaking it and preparing for The Big Switch. And the last one was tweaking it afterward.

Would I do it again? Definitely! Both the cost and performance are better than we expected. I know we could have done it better, but that’s past now. I’m also happy with how much I learned from it – it was nice to feel completely lost.

Now let’s work on those spans…

I know there’s also Tempo from Grafana, but we’ll most likely stick to what we have in-house experience with. I plan to write a blog post on tracing, too – will share the decision there.

Did you know that the Accept-Charset is deprecated? We didn’t, but our firewall did. And it didn’t like it.

RFC 9110: Note: Accept-Charset is deprecated because UTF-8 has become nearly ubiquitous and sending a detailed list of user-preferred charsets wastes bandwidth, increases latency, and makes passive fingerprinting far too easy (Section 17.13). Most general-purpose user agents do not send Accept-Charset unless specifically configured to do so.

With XHRs, sticky sessions are a must, as that’s the only way to make sure every request will reach the same server.

⁴

Did you know that you can attach a debugger to a running Node.js process without the --inspect flag? The SIGUSR1 signal does exactly that.