@radekmie
By Radosław Miernik · Published on · Comment on Reddit
A few months back, in On How We Moved to Kubernetes, I mentioned that we’re looking into implementing distributed tracing into our app. And as we managed to do so, I’d like to walk you through our setup and explain why I’d recommend looking into it to everyone (and as soon as possible).
Let me hook you with the most immediate benefit: we have a much easier time while debugging API requests that involve multiple services; below, I’ll explain why. At the same time, we reduced the number of log lines by roughly 85%.
First of all, let’s start with the basics: meet OpenTelemetry. If you have heard of OpenCensus or OpenTracing and are wondering which one to choose, don’t worry; there’s no decision to make, as the first one is a product of merging the other two a couple of years back [1, 2, 3].
In short, OpenTelemetry is an observability framework and toolkit. It consists of a lot of specifications (e.g., a communication protocol), conventions (e.g., what attributes a GraphQL server should add), libraries (e.g., @opentelemetry/api
for Node.js), and tools (e.g., the OpenTelemetry Collector).
It sounds like a lot at first, I know. Luckily, you don’t need to go through all of that at first! For now, remember this: there are hundreds of libraries, tools, and conventions, but most of them just work, straight out of the box. It’s entirely up to you how much information you want to collect, so consider almost everything to be opt-in. (We started back in February and still tweak it every week.)
Let’s start with the basics: we’d like to collect traces from a single service and be able to see them somewhere. But what are traces, really? Here’s an analogy that worked for my teammate: a log is a bunch of information with a timestamp (i.e., the time it was logged), a span is a log with duration (or, conversely, some “end” timestamp), and a trace is a collection of nested spans. Here’s an example:
The above trace contains 31 spans across three services (names in bold). At first glance, we can see that the router
service executed an HTTP request against core.server
, and then another one against marketplace.pms
. It’s clear where the time is spent and how much overhead is added at each step.
Furthermore, if we collapse the prisma:client:operation
spans representing a database call (at the bottom), it’s clear that they were executed in parallel, unlike the HTTP requests.
Now, how do we know what exactly happened? Well, that’s where the span attributes come in. Let’s see inside one of the prisma:engine:db_query
spans:
We can see the executed SQL in the db.statement
attribute. Every span can have as many attributes as needed (though too many may not be a good idea performance-wise), each storing a boolean, number, string, or an array of these.
How is that any better than going through logs of these services?
Sure, some traces are massive (like, tens of thousands of spans), but that also means something: maybe the business process should be split into multiple, or at least traced independently. On the other hand, it’s satisfying to scroll through one and see how your app performs under such a load.
If you’re already sold and would like to give it a try… Where’s the catch? If you like the above, well, there’s none – that’s just the default configuration. It’s a combination of Apollo Router, Prisma, and @opentelemetry/sdk-node
. We were pleasantly surprised that the tools we’re using required only some minor changes, and they started tracing everything automatically.
Where do we send them, though? In the previous text, I mentioned Jaeger and that we have some experience with it. However, we decided to go all-in with Grafana’s stack and went with Tempo instead – that’s also where all the above screenshots are from. (There’s plenty more Grafana tools to use!)
However, as we instrumented more and more services, we wanted to get rid of some of the span attributes we didn’t need (e.g., network.protocol.version
). That’s why we added the OpenTelemetry Collector1. It aggregates everything, applies all sorts of filters and transforms, and then sends batches to Tempo.
Of course, the deeper you dig, the more customization you need. What we did so far was disabled quite a lot of the builtin Node.js instrumentation (it was too verbose for our taste), pruned a lot of attributes from the rest (38 attributes in total), filtered out entire spans (e.g., the OPTIONS requests), and marked some of the errored spans as OK (e.g., if they failed during the Automatic Persisted Query “handshake” flow). Again, I know it’s a lot, but the point is that you can adjust it to your needs, whatever these are.
There were only two things that didn’t go that smoothly:
The AWS Lambda integration. We had some serious issues with setting up the layer properly, as it sent the traces only sometimes. Luckily, building the layer from the source with some configuration tweaks on top fixed it.
We’re stuck on Meteor 2.16. The problem is that Fibers mess up with @opentelemetry/context-async-hooks
, making it lose context at almost every async operation, leading to orphaned spans. If you’re in the same position, feel free to copy the code below.
FiberContextManager
// It's not exported by default.
;
;
// @ts-expect-error No types.
;
;
Traces are sent and accessible in Tempo, nice! Can we do more with them, though? Tempo’s metrics generator is the answer, as it generates metrics (duh) based on the traces. Metrics are simply numeric values captured over time, so something we can put on a chart.
While it only calculates the spans’ duration and count over time, it’s definitely enough. As each metric can be annotated with labels, you can chart the number of all spans vs errored ones, or the p50/p95/p99 duration of spans with certain names (i.e., operations). With custom labels, you can break down these numbers by tenant, version, or a feature flag.
It can also generate a service graph, presenting the topology of your system. It’s nice, though in our case it wasn’t that interesting, as we have a “main” service talking to everything else (that’s a consequence of using a federated graph).
What do we do with these metrics then? Well, all sorts of things: error details (where and what fails), sent webhook statistics (counts by type/tenant/URL), GraphQL and REST APIs timings (heatmap). I really like the idea of generating these out of traces, as we don’t have to measure them independently; neither based on logs nor with explicit in-app tracking.
The last step of our traces journey would be getting rid of at least some logs. We see that it’s much simpler to wrap a piece of code in a dedicated span with custom attributes, as they’re automatically correlated with the ongoing API request. It also provides more information by default, thanks to its duration.
However, there are still some cases where you’d like to log something. In such cases, I highly recommend logging the trace ID as well – it’ll make the later investigation trivial. And Tempo can link to it automatically, too.
As I already mentioned, we reduced the number of logs by roughly 85%. That’s because previously, we logged some metadata about all GraphQL and REST API requests, as well as the sent webhooks. Overall, we store slightly more information (byte-wise), but the debugging experience is definitely worth it.
The world of observability is vast, and I’m new to it, too! But, my dear reader, given the benefits, I will evangelize it almost as hard as peer review (yes, not everyone’s doing it). The fact that it took just a couple of weeks to set up in a decade-old app proves that everyone can start with it relatively fast.
We will continue adding more spans, attributes, links, span events… Whatever is needed to make our lives easier. Am I worried about adding too much or not adding enough? Well, we had the same dilemmas with logs, so I’m good.
Now excuse me, I have some traces to review.
Yes, I said we’re all-in with Grafana and that we use OpenTelemetry Collector instead of Grafana Alloy in one paragraph. That’s because we learned about it only after we had the former already running. However, as Alloy combines both Prometheus and Collector in one service, we may revisit it in the future.