@radekmieBy Radosław Miernik · Published on · Comment on Reddit
While working on my PhD, I had to implement an interpreter of the language we’re working on. At the very core of it are finite automata and finite state machines, as well as tons of operations on them. In practice, it’s a directed graph with operations on the edges (assignments, comparisons, etc.).
As I’m most familiar with JavaScript, I implemented the first of the interpreter version in it. It allowed me to have a working version rather fast; as a bonus, it works in the browser, so it’s much more accessible to other people. However, at some point, the performance was not good enough to work interactively.
Because of that, I decided to implement the same interpreter in a low-level language. As the automata definition was a simple JSON object, I decided to go with Rust I was already familiar with. Thanks to Serde, I had a 1-to-1 matching between the TypeScript and Rust structures within minutes.
In this text, I’ll focus on the performance of one operation that had problems in JavaScript – checking reachability. In the future, I plan to dive more into the whole project and what the rest of the code looks like.
Shall we?
We’ll need a couple of definitions first. An Automaton (graph) is a collection of Edges. Every Edge connects two Nodes with a label. It has only one, rather intuitive operation: add_edge. The exact representation of it is not relevant to the algorithm, so we can use whatever we find suitable.Example automaton.
There’s also State, which defines the state of the automaton we’re in. Most importantly, there’s the position of it, which points to one of the Nodes. It has one crucial operation – next_states – which is used to iterate over the following states, i.e., the states that can be reached from the current one. That means it has to check all of the edges that connect the position with other nodes and check whether it’s allowed in the current state.
An example usage of next_states is the dreaded is_reachable function, which checks, whether it’s possible to traverse the graph to a specified Node. It’s a recursive operation, potentially traversing the entire graph. In practice, it’s only a small subset of it, but it’s called very, very often.
Let’s get to coding and basic type definitions. They’re the same for all versions so I won’t duplicate them in each. It’s a simplified version, but enough for the rest of the examples to run.
// `pub` means this type can be imported in other modules. The module
// system in Rust is slightly different, but think of it as of `export`
// in JavaScript.
// There's no information stored in the node, so a number is enough.
pub type Node = usize;
// Directed pair of nodes.
// A state of the automaton. In reality, there's much more data here.
// The `derive` attribute here generates a `default` function that
// creates an empty state, i.e., a state that is in `0`.
Now, let’s start with the most straightforward implementation I can imagine. A one I believe anyone who has never worked with Rust would implement.
As you can see, it’s only a couple of lines of code – nothing crazy, nothing “smart” your future self could ponder about. But as you see, there’s a very obvious performance problem there – we’re constantly switching back and forth between Vec and Iterator.
In Rust, Iterator is not a type of object, but rather a trait; sort of an interface in the TypeScript world. That means there’s some object there, but all we know about it is that we can iterate over it. We could skip the transformation, but then next_states would have to return something we’re not so sure of. Of course, it’s possible, but instead, let’s create our custom object we’ll implement the Iterator trait for.
To implement a trait, we need something to implement it for. Let’s create one that could be one’s first guess – a struct that stores the list of Edges we have to go through. Having done that, we’ll implement the Iterator trait and update the rest of the code accordingly.
It’s a very straightforward (and naive) wrap of our previous implementation into an iterator object. Definitely not a performant one (we’ll get to that later), but it allows us to improve on the iterator independently from the rest.
I said before, that the problem with the first implementation was constant transforming between Vectors and Iterators. It’s still there, as we have to .collect() them, i.e., create new Vectors when passing them down to the StateIterator object. Can we do better? Instead, we could reference the entire edges list in there and only store the index we’re still checking.
Is it better? Intuitively yes, because there are fewer Vectors created here. And as we know, fewer objects mean fewer operations, and that leads to a better performance. What’s next?
Even without checking the implementation of .get we can guess, the more the index grows, the more operation it needs. Can we somehow eliminate it?
Enter slices! A slice is a view into a contiguous sequence (of some memory). The API of it allows us to peek at the elements there and advance it (i.e., point to a later element).
Is it better? We’ll check the performance later, but intuitively it should be better, as there’s one field less in the StateIterator struct, and we’re no longer .geting the far elements but rather always only the first one.
What else could we do? We still have a problem: if there are hundreds of thousands of Edges there, every iterator has to go through all of them. That could be eliminated by storing them differently. Like, in a…
As I said at the beginning, the Automaton is built once, and then we iterate over it a lot. If we make the next_states faster at the cost of the add_edge being slower, it’s most likely going to be a huge win.
Let’s do that then – Automaton should store the Edges in a way that’ll allow us to get the outgoing ones fast. Let’s use a BTreeMap, i.e., a tree-based key-value store1. A key will be the lhs of an Edge, and the value will be a Vector of the outgoing Edges.
use BTreeMap;
Intuitively, it should be much faster, as the creation of the StateIterator scales with the number of unique Nodes and not Edges. Additionally, the iteration itself is now direct, i.e., there’s no need to check if the lhs equals our position. That should be fast… I think.
Finally, we’ve arrived at what seems a sensible and performant approach. Of course, we’re responsible people, so instead of relying solely on a gut feeling, we’ll create a replicable benchmarking suite to compare the implementations. To do that, we’ll use the Criterion package2.
// Configure Criterion.
criterion_group!;
criterion_main!;
Having done that, we have to add the dependency and basic configuration in our Cargo.toml, and we’re good to go. Let’s run cargo benchmark, chart our results3 and see how each version performs on my M2 MacBook Air.
Well… That’s not something you expected, huh? If you think about it, the sheer number of copies makes the creation of iterators extremely slow. Actually, it was so slow I lost patience while waiting for it to finish and decided to skip it entirely. Let’s remove all but the first point of version 2 and see the chart again.
Much better! First of all, we see that versions 1 and 2 perform similarly when the branching factor is 1. That makes perfect sense, as next_states always creates a vector of length 1, and is_reachable immediately consumes it.
Secondly, versions 3 and 4 are basically identical in terms of performance. Advancing a slice is as simple as moving the pointer, so version 3 tends to be marginally slower as it has to index it from the start at every step instead (and it’s visible on the chart).
Lastly, version 5 outperforms all of them. That’s because the cost of navigation in such a small tree is negligible (the biggest automaton I’ve worked with so far had a few thousand states). As a result, the performance of creating an iterator scales with the branching factor and not the number of edges4. Please note that it’s formatted slightly differently, according to the standard Table with the complete results.
Version Branching Min Avg Max 1 1 2.627µs 2.629µs 2.631µs 2 1 2.656µs 2.659µs 2.660µs 3 1 881.4ns 882.7ns 884.0ns 4 1 830.8ns 831.9ns 833.1ns 5 1 618.3ns 618.7ns 619.1ns 1 2 4.839µs 4.851µs 4.863µs 2 2 20.63ms 20.66ms 20.70ms 3 2 1.488µs 1.490µs 1.492µs 4 2 1.395µs 1.395µs 1.396µs 5 2 615.6ns 616.6ns 617.5ns 1 3 5.895µs 5.912µs 5.930µs 2 3 949.4ms 949.9ms 950.5ms 3 3 2.041µs 2.048µs 2.056µs 4 3 2.033µs 2.035µs 2.038µs 5 3 612.9ns 614.1ns 615.4ns 1 4 7.216µs 7.244µs 7.274µs 2 4 4.8150s 4.8300s 4.8580s 3 4 2.553µs 2.555µs 2.556µs 4 4 2.491µs 2.493µs 2.494µs 5 4 611.6ns 612.7ns 613.9ns 1 5 10.68µs 10.71µs 10.75µs 2 5 ??? ??? ??? 3 5 3.084µs 3.087µs 3.090µs 4 5 2.990µs 2.991µs 2.992µs 5 5 610.2ns 611.0ns 611.9ns Entire source code, including the benchmark.
cargo fmt rules. It also passes the cargo clippy linter. Rust v1.65.0.use ;
pub type Node = usize;
criterion_group!;
criterion_main!;
That’s a wrap. It was extremely interesting and educative to me. I didn’t use Rust’s slices that much, and here they really shined. The final performance is more than enough, and I’m happy with it for now.
If you’re interested, the complete interpreter in Rust is between 5 and 20 times faster than the JavaScript one. There are definitely more improvements to come, but for now, the performance is fine.
Now, back to work!
We could also use HashMap. It doesn’t matter for us API-wise, but it turned out to be slower, at least in my small-scale tests. I guess it’d be beneficial if the number of Nodes were much higher or the Node itself was bigger.
I’d rather use the built-in #[bench] attribute, but it remains unstable. On the other hand, Criterion provides far more detailed feedback and provides tools to measure how code changes impact the benchmarks.
I wanted to style the charts differently as well as inline them into the post, so I used Vega instead of Criterion’s charts. It’s a powerful tool that allows you to “code” your charts, and I strongly recommend it!
As you can see, the time actually goes down with the branching factor, which is at least weird, but I was able to consistently reproduce it. At first, I thought that’s because the shortest path is shorter, but that shouldn’t matter, as the iteration is depth-first and always finds the longest path.