By Radosław Miernik · Published on · Comment on reddit
Whenever I introduce a regular expression – “regex” for short – in a project, there’s always someone that is not fond of this idea. I heard “write-only” and “unmaintanable” way too many times. But are they really?
I don’t think that my daily programming practice is far from a “standard” one, whatever this means. And I do use regexes daily – while searching through the code, working with
git, or even Google Sheets. As I just checked, more than half of my Sublime Text search history actually contains a regex.
Just like any other piece of technology, there’s some theory behind it, multiple standards, and a ton of tools to make our lives easier. In this text, I’ll go through most of it to hopefully make you less afraid of them!
Classically, regular expressions are elementary and are equivalent to regular grammars. Both are used to define a set of strings. The definition goes like this:
Σ(a set of characters):
∅denotes an empty set.
εdenotes a set only with the “empty” string.
xdenotes a set containing only
ABdenotes a set of all strings obtained by concatenating any string from
Awith any string from
A|Bdenotes a set union of
A*denotes a set containing
AAA, and so on, i.e., any number of
I hope this didn’t scare you! I don’t want to jump into detail here nor force you to prove anything using these. My point here is that this definition is extremely concise. There is an empty string, single characters, and only three operations.
Such regular expressions allow us to do basic things like “any binary number” (
0|1(0|1)*1). Of course, we can define all kinds of “helpers” to make them handier, like “at least one
A+ = AA*) or “at most one
A? = ε|A).
How does one actually use them? One way is to use Thompson’s construction algorithm to transform the regex into an NFA. Such an automaton can be used as it is or can be determinized into a DFA for simplicity2. The DFA itself can be optimized for the sake of performance through minimization.
I think that it’s good to understand the theory behind them, at least on a basic level, as it allows to sensibly reason about the more powerful (i.e., “real world”) regex implementations. And speaking of which…
As usual with the standards, there are way too many. The POSIX alone defines three of them, with different capabilities:
All three are probably the most popular ones overall, as most POSIX programs like
sed use them. Practically speaking, they are enough and fine.
Some languages have their own standards. For example, Lua has a relatively simple one and explains that it’s because of the complexity of the POSIX ones. I worked with Lua quite a lot, and I have to admit that it’s clear and more than enough for most cases. It has, however, an interesting extension:
%b operator that matches only balanced strings (e.g.,
%b() matches balanced parentheses).
But are all of these features actually needed? I’d argue that yes, they are. Most of them allow us to
hide offload the complexity of the problem into a single regex. Of course, it forces the programmer to learn (sort of) another language, but in many cases, it’s simply worth it.
The first group consists of the ones used for parsing some well-defined string format coming from the outside (directly from the user or through an API). Whether it’s a signed decimal number in GraphQL (
/^[+-]?\d+(?:\.\d+)?$/), or a time range (
/^\s*(\d\d?:\d\d)\s*-\s*(\d\d?:\d\d)\s*$/) – they most likely have little to no performance impact as the input data is usualy small5.
The second group is the ones that parse a really small DSL, which is there only to make the code easier to read. For example, SparrowQL requires a list of all possible connections between your MongoDB collections. Every connection has four string fields:
foreign, that sort of match the
$lookup operator from MongoDB. Now, instead of maintaining a long list of objects, I’d rather have strings like
from.local -> to.foreign. The “parser” has only eight lines including one regex (
/^(.*?)\.(.*?) -> (.*?)\.(.*?)$/).
Similar DSLs are great for all kinds of test data – text configuration is easier than potentially large objects. They have a downside, though – there’s no autocomplete nor typechecking6. They also tend to require more test cases, as the code coverage of a regex remains unclear.
The third and final group would be the ones that replace hand-crafted string algorithms. For example, instead of doing series of
.join()s, we can use
.replace() with a regex and a replacing function. It may sound unnecessary, but getting rid of intermediate arrays may lead to a significant performance boost.
Recently, I implemented a fairly complex feature in
uniforms. At first, I was sure that this would require a full-blown parser with a ton of additional code. But after some time, it turned out that the problem was entirely solvable using a regex. A fairly big and complex one… Let’s see what’s inside!
// This regular expression splits the string into three parts: // `prefix` is a dotted name, e.g., `object.nested.2.field` at the // front (hence prefix). It covers most standard usecases. // `subscript` is a `["..."]` subscript after the prefix. The content // within should be escaped by the user, e.g., `["\\""]`. // `rest` is anything following the subscript. The leading dot (`.`) // is stripped (`.a` -> `a`) if there is one. It is empty if // `subscript` is empty. // // All three parts can be empty! ;
The comment alone should be enough for most people to know what to expect. However, I wanted to take this opportunity and describe in detail what this “monstrosity” does. I’ll also dive into the performance of it later.
prefix part is straightforward: it’s any number of “safe characters” (only a dot (
.) and square brackets (
]) are considered unsafe; the final regex is
[^.[\]]*) followed by any number of the exact same safe characters prefixed with a literal dot (
(?:\.[^.[\]]+)*; this part is enclosed in a non-capturing group to keep the actual match small).
subscript part is challenging, as we have to deal with escaping. The first version (as seen above) relied on a lookbehind to do so (we’ll get to that in a minute). Unfortunately, Safari doesn’t support lookbehinds, so we had to get rid of it. Let’s discuss how both versions (before and after) actually work.
Both start with a literal subscript opening (
\[") followed by a non-capturing group (
(?:...)) of characters other than double quote (
[^"]) or an escaped double quote (
\"). The entire group is repeated as few times as needed (
Then there’s the only difference: handling of the final double quote (
"] is fine;
\"] is not). The first version uses a lookbehind to make sure that the final double quote is not escaped, i.e., preceded by a backslash (
(?<!\\)"]). The new version is more verbose – it checks whether the last character is different than the backslash (
[^\\]). Of course, this breaks with empty subscripts (
[""]), so the entire content of the new version is enclosed in one additional optional non-capturing group (
(?:...)?). Here’s a diff of the subscript part:
rest part is an optional dot (
\.?) followed by any number of all characters (
(.*)). We do remove the dot after the subscript, but it’s optional to support multiple subscripts (
Additionally, the actual regex has a couple of capturing groups (
(...)) to extract the matched parts. Technically, all of the groups could be capturing, but it slightly affects the performance (the difference wasn’t measurable during my tests, though) and forces the programmer to skip the unwanted matches.
To ensure everything works as expected, I have created an extensive test suite. Interestingly enough, there are quite a few tests with “incorrect” inputs, as we agreed on the “garbage in, garbage out” approach for this logic.
When it comes to the performance, this new regex-based implementation turned out not to be much slower than the approach we’ve used before (
.split('.')) while introducing an important feature; on actual data, it was ~7% slower. The regex matching alone is expected to be fast, as we do have quite a few “anchoring” characters – dots in the first part and the subscript itself – and there’s no backtracking required to match the string7.
Other tools can simulate the execution of the matching process like this one. It’s both helpful and fun to see how fast the automatons explode in size when the alphabet is larger or the regex is longer.
It’s also possible to generate matching strings for a given regex, like here. Keep in mind that it may be extremely slow for some regexes. Similarly, you can generate regexes based on actual input, e.g., numbers.
I hope that you no longer find regular expressions intimidating. Don’t worry, it’ll take time to feel comfortable using them more; I assure you it’s worth it. Maybe you’ll get better at searching through the logs? Or you’ll get rid of this nasty piece of code you wrote years ago?
Can you find a place for regexes in your life? I know I did.
This pattern represents all binary numbers without leading zeros. It’d be slightly different, if we’d like to allow leading zeroes:
An NFA can be simulated in
O(mn) time, where
n is the size of the input,
m is the size of the regex. A DFA runs in linear time (
O(n)), so it seems obviously better. However, the determinization of the NFA itself may be exponential, making it not so obvious. There are other practical approaches, though.
It’s hard to say how expressive PCRE is in terms of complexity, but they are definitely “higher” than the classical ones I presented in the first section (i.e., more expressive). However, it’s not a practical problem.
Backreferences alone are an extremely powerful feature – it actually makes PCRE matching NP-Hard. One example from my find-and-replace history is
(\w+) = \1 \+. It matches assignments that can be replaced with a
+= operator (e.g.,
foo = foo + 2 can be replaced with
foo += 2).
As the Unix philosophy says:
n is usually small.
I believe that TypeScript could handle it with template literal types, inference, and a lot of extra code. However, I don’t think it’s worth it (in most cases).
There are regular expressions that require a potentially exponential number of steps to test a string, like
(a|a)+$. Unfortunately, similar patterns are often the result of a naive implementation of the pattern. (There’s an ESLint plugin for that.) It may also be an attack vector when the regex comes from the user.