@radekmie
By Radosław Miernik · Published on · Comment on Reddit
While working on my PhD, I had to implement an interpreter of the language we’re working on. At the very core of it are finite automata and finite state machines, as well as tons of operations on them. In practice, it’s a directed graph with operations on the edges (assignments, comparisons, etc.).
As I’m most familiar with JavaScript, I implemented the first of the interpreter version in it. It allowed me to have a working version rather fast; as a bonus, it works in the browser, so it’s much more accessible to other people. However, at some point, the performance was not good enough to work interactively.
Because of that, I decided to implement the same interpreter in a low-level language. As the automata definition was a simple JSON object, I decided to go with Rust I was already familiar with. Thanks to Serde, I had a 1-to-1 matching between the TypeScript and Rust structures within minutes.
In this text, I’ll focus on the performance of one operation that had problems in JavaScript – checking reachability. In the future, I plan to dive more into the whole project and what the rest of the code looks like.
Shall we?
We’ll need a couple of definitions first. An Automaton
(graph) is a collection of Edge
s. Every Edge
connects two Node
s with a label. It has only one, rather intuitive operation: add_edge
. The exact representation of it is not relevant to the algorithm, so we can use whatever we find suitable.Example automaton.
There’s also State
, which defines the state of the automaton we’re in. Most importantly, there’s the position
of it, which points to one of the Node
s. It has one crucial operation – next_states
– which is used to iterate over the following states, i.e., the states that can be reached from the current one. That means it has to check all of the edges that connect the position
with other nodes and check whether it’s allowed in the current state.
An example usage of next_states
is the dreaded is_reachable
function, which checks, whether it’s possible to traverse the graph to a specified Node
. It’s a recursive operation, potentially traversing the entire graph. In practice, it’s only a small subset of it, but it’s called very, very often.
Let’s get to coding and basic type definitions. They’re the same for all versions so I won’t duplicate them in each. It’s a simplified version, but enough for the rest of the examples to run.
// `pub` means this type can be imported in other modules. The module
// system in Rust is slightly different, but think of it as of `export`
// in JavaScript.
// There's no information stored in the node, so a number is enough.
pub type Node = usize;
// Directed pair of nodes.
// A state of the automaton. In reality, there's much more data here.
// The `derive` attribute here generates a `default` function that
// creates an empty state, i.e., a state that is in `0`.
Now, let’s start with the most straightforward implementation I can imagine. A one I believe anyone who has never worked with Rust would implement.
As you can see, it’s only a couple of lines of code – nothing crazy, nothing “smart” your future self could ponder about. But as you see, there’s a very obvious performance problem there – we’re constantly switching back and forth between Vec
and Iterator
.
In Rust, Iterator
is not a type of object, but rather a trait; sort of an interface in the TypeScript world. That means there’s some object there, but all we know about it is that we can iterate over it. We could skip the transformation, but then next_states
would have to return something we’re not so sure of. Of course, it’s possible, but instead, let’s create our custom object we’ll implement the Iterator
trait for.
To implement a trait, we need something to implement it for. Let’s create one that could be one’s first guess – a struct that stores the list of Edge
s we have to go through. Having done that, we’ll implement the Iterator
trait and update the rest of the code accordingly.
It’s a very straightforward (and naive) wrap of our previous implementation into an iterator object. Definitely not a performant one (we’ll get to that later), but it allows us to improve on the iterator independently from the rest.
I said before, that the problem with the first implementation was constant transforming between Vec
tors and Iterator
s. It’s still there, as we have to .collect()
them, i.e., create new Vec
tors when passing them down to the StateIterator
object. Can we do better? Instead, we could reference the entire edges
list in there and only store the index we’re still checking.
Is it better? Intuitively yes, because there are fewer Vec
tors created here. And as we know, fewer objects mean fewer operations, and that leads to a better performance. What’s next?
Even without checking the implementation of .get
we can guess, the more the index grows, the more operation it needs. Can we somehow eliminate it?
Enter slice
s! A slice
is a view into a contiguous sequence (of some memory). The API of it allows us to peek at the elements there and advance it (i.e., point to a later element).
Is it better? We’ll check the performance later, but intuitively it should be better, as there’s one field less in the StateIterator
struct, and we’re no longer .get
ing the far elements but rather always only the first one.
What else could we do? We still have a problem: if there are hundreds of thousands of Edge
s there, every iterator has to go through all of them. That could be eliminated by storing them differently. Like, in a…
As I said at the beginning, the Automaton
is built once, and then we iterate over it a lot. If we make the next_states
faster at the cost of the add_edge
being slower, it’s most likely going to be a huge win.
Let’s do that then – Automaton
should store the Edge
s in a way that’ll allow us to get the outgoing ones fast. Let’s use a BTreeMap
, i.e., a tree-based key-value store1. A key will be the lhs
of an Edge
, and the value will be a Vec
tor of the outgoing Edge
s.
use BTreeMap;
Intuitively, it should be much faster, as the creation of the StateIterator
scales with the number of unique Node
s and not Edge
s. Additionally, the iteration itself is now direct, i.e., there’s no need to check if the lhs
equals our position
. That should be fast… I think.
Finally, we’ve arrived at what seems a sensible and performant approach. Of course, we’re responsible people, so instead of relying solely on a gut feeling, we’ll create a replicable benchmarking suite to compare the implementations. To do that, we’ll use the Criterion package2.
// Configure Criterion.
criterion_group!;
criterion_main!;
Having done that, we have to add the dependency and basic configuration in our Cargo.toml
, and we’re good to go. Let’s run cargo benchmark
, chart our results3 and see how each version performs on my M2 MacBook Air.
Well… That’s not something you expected, huh? If you think about it, the sheer number of copies makes the creation of iterators extremely slow. Actually, it was so slow I lost patience while waiting for it to finish and decided to skip it entirely. Let’s remove all but the first point of version 2 and see the chart again.
Much better! First of all, we see that versions 1 and 2 perform similarly when the branching factor is 1. That makes perfect sense, as next_states
always creates a vector of length 1, and is_reachable
immediately consumes it.
Secondly, versions 3 and 4 are basically identical in terms of performance. Advancing a slice
is as simple as moving the pointer, so version 3 tends to be marginally slower as it has to index it from the start at every step instead (and it’s visible on the chart).
Lastly, version 5 outperforms all of them. That’s because the cost of navigation in such a small tree is negligible (the biggest automaton I’ve worked with so far had a few thousand states). As a result, the performance of creating an iterator scales with the branching factor and not the number of edges4. Please note that it’s formatted slightly differently, according to the standard Table with the complete results.
Version Branching Min Avg Max 1 1 2.627µs 2.629µs 2.631µs 2 1 2.656µs 2.659µs 2.660µs 3 1 881.4ns 882.7ns 884.0ns 4 1 830.8ns 831.9ns 833.1ns 5 1 618.3ns 618.7ns 619.1ns 1 2 4.839µs 4.851µs 4.863µs 2 2 20.63ms 20.66ms 20.70ms 3 2 1.488µs 1.490µs 1.492µs 4 2 1.395µs 1.395µs 1.396µs 5 2 615.6ns 616.6ns 617.5ns 1 3 5.895µs 5.912µs 5.930µs 2 3 949.4ms 949.9ms 950.5ms 3 3 2.041µs 2.048µs 2.056µs 4 3 2.033µs 2.035µs 2.038µs 5 3 612.9ns 614.1ns 615.4ns 1 4 7.216µs 7.244µs 7.274µs 2 4 4.8150s 4.8300s 4.8580s 3 4 2.553µs 2.555µs 2.556µs 4 4 2.491µs 2.493µs 2.494µs 5 4 611.6ns 612.7ns 613.9ns 1 5 10.68µs 10.71µs 10.75µs 2 5 ??? ??? ??? 3 5 3.084µs 3.087µs 3.090µs 4 5 2.990µs 2.991µs 2.992µs 5 5 610.2ns 611.0ns 611.9ns Entire source code, including the benchmark.
cargo fmt
rules. It also passes the cargo clippy
linter. Rust v1.65.0.use ;
pub type Node = usize;
criterion_group!;
criterion_main!;
That’s a wrap. It was extremely interesting and educative to me. I didn’t use Rust’s slice
s that much, and here they really shined. The final performance is more than enough, and I’m happy with it for now.
If you’re interested, the complete interpreter in Rust is between 5 and 20 times faster than the JavaScript one. There are definitely more improvements to come, but for now, the performance is fine.
Now, back to work!
We could also use HashMap
. It doesn’t matter for us API-wise, but it turned out to be slower, at least in my small-scale tests. I guess it’d be beneficial if the number of Node
s were much higher or the Node
itself was bigger.
I’d rather use the built-in #[bench]
attribute, but it remains unstable. On the other hand, Criterion provides far more detailed feedback and provides tools to measure how code changes impact the benchmarks.
I wanted to style the charts differently as well as inline them into the post, so I used Vega instead of Criterion’s charts. It’s a powerful tool that allows you to “code” your charts, and I strongly recommend it!
As you can see, the time actually goes down with the branching factor, which is at least weird, but I was able to consistently reproduce it. At first, I thought that’s because the shortest path is shorter, but that shouldn’t matter, as the iteration is depth-first and always finds the longest path.