By Radosław Miernik · Published on · Comment on reddit
While working on my PhD, I had to implement an interpreter of the language we’re working on. At the very core of it are finite automata and finite state machines, as well as tons of operations on them. In practice, it’s a directed graph with operations on the edges (assignments, comparisons, etc.).
Because of that, I decided to implement the same interpreter in a low-level language. As the automata definition was a simple JSON object, I decided to go with Rust I was already familiar with. Thanks to Serde, I had a 1-to-1 matching between the TypeScript and Rust structures within minutes.
We’ll need a couple of definitions first. An
Automaton (graph) is a collection of
Edge connects two
Nodes with a label. It has only one, rather intuitive operation:
add_edge. The exact representation of it is not relevant to the algorithm, so we can use whatever we find suitable.
State, which defines the state of the automaton we’re in. Most importantly, there’s the
position of it, which points to one of the
Nodes. It has one crucial operation –
next_states – which is used to iterate over the following states, i.e., the states that can be reached from the current one. That means it has to check all of the edges that connect the
position with other nodes and check whether it’s allowed in the current state.
An example usage of
next_states is the dreaded
is_reachable function, which checks, whether it’s possible to traverse the graph to a specified
Node. It’s a recursive operation, potentially traversing the entire graph. In practice, it’s only a small subset of it, but it’s called very, very often.
Let’s get to coding and basic type definitions. They’re the same for all versions so I won’t duplicate them in each. It’s a simplified version, but enough for the rest of the examples to run.
Now, let’s start with the most straightforward implementation I can imagine. A one I believe anyone who has never worked with Rust would implement.
As you can see, it’s only a couple of lines of code – nothing crazy, nothing “smart” your future self could ponder about. But as you see, there’s a very obvious performance problem there – we’re constantly switching back and forth between
Iterator is not a type of object, but rather a trait; sort of an interface in the TypeScript world. That means there’s some object there, but all we know about it is that we can iterate over it. We could skip the transformation, but then
next_states would have to return something we’re not so sure of. Of course, it’s possible, but instead, let’s create our custom object we’ll implement the
Iterator trait for.
To implement a trait, we need something to implement it for. Let’s create one that could be one’s first guess – a struct that stores the list of
Edges we have to go through. Having done that, we’ll implement the
Iterator trait and update the rest of the code accordingly.
It’s a very straightforward (and naive) wrap of our previous implementation into an iterator object. Definitely not a performant one (we’ll get to that later), but it allows us to improve on the iterator independently from the rest.
I said before, that the problem with the first implementation was constant transforming between
Iterators. It’s still there, as we have to
.collect() them, i.e., create new
Vectors when passing them down to the
StateIterator object. Can we do better? Instead, we could reference the entire
edges list in there and only store the index we’re still checking.
Is it better? Intuitively yes, because there are fewer
Vectors created here. And as we know, fewer objects mean fewer operations, and that leads to a better performance. What’s next?
Even without checking the implementation of
.get we can guess, the more the index grows, the more operation it needs. Can we somehow eliminate it?
slice is a view into a contiguous sequence (of some memory). The API of it allows us to peek at the elements there and advance it (i.e., point to a later element).
Is it better? We’ll check the performance later, but intuitively it should be better, as there’s one field less in the
StateIterator struct, and we’re no longer
.geting the far elements but rather always only the first one.
What else could we do? We still have a problem: if there are hundreds of thousands of
Edges there, every iterator has to go through all of them. That could be eliminated by storing them differently. Like, in a…
As I said at the beginning, the
Automaton is built once, and then we iterate over it a lot. If we make the
next_states faster at the cost of the
add_edge being slower, it’s most likely going to be a huge win.
Let’s do that then –
Automaton should store the
Edges in a way that’ll allow us to get the outgoing ones fast. Let’s use a
BTreeMap, i.e., a tree-based key-value store1. A key will be the
lhs of an
Edge, and the value will be a
Vector of the outgoing
Intuitively, it should be much faster, as the creation of the
StateIterator scales with the number of unique
Nodes and not
Edges. Additionally, the iteration itself is now direct, i.e., there’s no need to check if the
lhs equals our
position. That should be fast… I think.
Finally, we’ve arrived at what seems a sensible and performant approach. Of course, we’re responsible people, so instead of relying solely on a gut feeling, we’ll create a replicable benchmarking suite to compare the implementations. To do that, we’ll use the Criterion package2.
// Configure Criterion. criterion_group!; criterion_main!;
Having done that, we have to add the dependency and basic configuration in our
Cargo.toml, and we’re good to go. Let’s run
cargo benchmark, chart our results3 and see how each version performs on my M2 MacBook Air.
Well… That’s not something you expected, huh? If you think about it, the sheer number of copies makes the creation of iterators extremely slow. Actually, it was so slow I lost patience while waiting for it to finish and decided to skip it entirely. Let’s remove all but the first point of version 2 and see the chart again.
Much better! First of all, we see that versions 1 and 2 perform similarly when the branching factor is 1. That makes perfect sense, as
next_states always creates a vector of length 1, and
is_reachable immediately consumes it.
Secondly, versions 3 and 4 are basically identical in terms of performance. Advancing a
slice is as simple as moving the pointer, so version 3 tends to be marginally slower as it has to index it from the start at every step instead (and it’s visible on the chart).
Lastly, version 5 outperforms all of them. That’s because the cost of navigation in such a small tree is negligible (the biggest automaton I’ve worked with so far had a few thousand states). As a result, the performance of creating an iterator scales with the branching factor and not the number of edges4. Please note that it’s formatted slightly differently, according to the standard
Table with the complete results.
Version Branching Min Avg Max 1 1 2.627µs 2.629µs 2.631µs 2 1 2.656µs 2.659µs 2.660µs 3 1 881.4ns 882.7ns 884.0ns 4 1 830.8ns 831.9ns 833.1ns 5 1 618.3ns 618.7ns 619.1ns 1 2 4.839µs 4.851µs 4.863µs 2 2 20.63ms 20.66ms 20.70ms 3 2 1.488µs 1.490µs 1.492µs 4 2 1.395µs 1.395µs 1.396µs 5 2 615.6ns 616.6ns 617.5ns 1 3 5.895µs 5.912µs 5.930µs 2 3 949.4ms 949.9ms 950.5ms 3 3 2.041µs 2.048µs 2.056µs 4 3 2.033µs 2.035µs 2.038µs 5 3 612.9ns 614.1ns 615.4ns 1 4 7.216µs 7.244µs 7.274µs 2 4 4.8150s 4.8300s 4.8580s 3 4 2.553µs 2.555µs 2.556µs 4 4 2.491µs 2.493µs 2.494µs 5 4 611.6ns 612.7ns 613.9ns 1 5 10.68µs 10.71µs 10.75µs 2 5 ??? ??? ??? 3 5 3.084µs 3.087µs 3.090µs 4 5 2.990µs 2.991µs 2.992µs 5 5 610.2ns 611.0ns 611.9ns
Entire source code, including the benchmark.
cargo fmt rules. It also passes the
cargo clippy linter. Rust v1.65.0.
pub type Node = usize;
Please note that it’s formatted slightly differently, according to the standard
That’s a wrap. It was extremely interesting and educative to me. I didn’t use Rust’s
slices that much, and here they really shined. The final performance is more than enough, and I’m happy with it for now.
Now, back to work!
We could also use
HashMap. It doesn’t matter for us API-wise, but it turned out to be slower, at least in my small-scale tests. I guess it’d be beneficial if the number of
Nodes were much higher or the
Node itself was bigger.
I’d rather use the built-in
#[bench] attribute, but it remains unstable. On the other hand, Criterion provides far more detailed feedback and provides tools to measure how code changes impact the benchmarks.
I wanted to style the charts differently as well as inline them into the post, so I used Vega instead of Criterion’s charts. It’s a powerful tool that allows you to “code” your charts, and I strongly recommend it!
As you can see, the time actually goes down with the branching factor, which is at least weird, but I was able to consistently reproduce it. At first, I thought that’s because the shortest path is shorter, but that shouldn’t matter, as the iteration is depth-first and always finds the longest path.