> *Nope. The regex library is tightly coupled with searching strings. That's an ...

burntsushi · on July 6, 2023

I answered basically this question about seven years ago[1]. Re-reading that now, it's actually still quite a good answer.

> but if you had a C library based on char vectors, you could global replace with short or long and it would still work, pointers included, except for any places where you relied on the knowledge that a char was implemented as 8 bits.

Right, and that last sentence is critical. Another way to say "rely on char being 8 bits" is "rely on an alphabet size of no more than 256."

The regex crate and all of its non-standard-library dependencies (of which I wrote all of them) corresponds to about 100K source lines of code. Everything from the concrete syntax on up to the matching engines itself would need to be parameterized over the alphabet. You can't just replace `u8` (unsigned 8-bit integer) with any other type and hope that it works, because the entire regex crate makes use of the fact that it's searching a sequence of `u8` values specifically. It doesn't treat `u8` as some opaque type with integer-like operations. The literal architecture of the code embeds assumptions about it, such as the fact that every possible value can be enumerated as [0, 1, 2, ..., 255]. The answer I wrote seven years ago touches on this. A concrete place where this occurs is in the lazy DFA (described in the blog). Very loosely, the logical representation of an individual state looks like this:

    struct State {
        next: [*State; 256],
    }

The `u8` type appears nowhere there. Instead, the length of `next` is hard-coded to the total number of unique possible elements of the `u8` type. One could of course define this as part of some generic alphabet type, but this entire representational choice is only feasible in the first place because the alphabet size of `u8` is so small. What's the alphabet size of `u16`? Oops. Now your logical representation is:

    struct State {
        next: [*State; 65536],
    }

You can see how that would be quarrelsome right?

Of course, you could use something like `HashMap<u16, *State>` instead. But holy moly you really do not want to do that when searching a sequence of `u8` values because a hashmap lookup for every byte in a string would absolutely trash performance. So now your generic alphabet interface needs to know about DFA state representation in-memory so that when you use `u8` you get the fast version and when you use anything else you get the slow-but-feasible version.

And then there's the whole UTF-8 automata stuff and various Unicode-aware algorithms sprinkled in at various places. This thread is talking about a generic alphabet, and that goes well beyond Unicode. So now all that Unicode stuff needs to be gated behind your generic alphabet interface.

At some point, you realize this is incredibly impractical, so if you want to expose a generic alphabet API, you do something like, "with a `u8` alphabet, go do all this fast stuff, but with any other alphabet, just do a basic simplistic generic regex engine." But that basic simpistic regex engine is going to wind up sharing very little with the `u8` specific stuff. And thus you get to my conclusion that it has no place in a general purpose regex engine designed for searching strings. For these niche use cases, you should just go build something for what you need and specialize for that. If you want a generic alphabet, then use that as your design criteria from the start and build for it. You'll wind up in a completely different (and likely much simpler) spot than where the regex crate is.

Basically, this line of thinking is a sort of "disease of abstraction" that is common in programming. We take what we think about the theory of regular languages and maybe some cute use cases and think, "well why not just make it generic!" But the thing is, coupling is generally how you make things fast, and making something more general usually involves de-coupling. So they are actually and usually exclusionary goals. But it takes a lot of domain knowledge to actually see this.

There's also a whole mess of code that uses low level vector instructions for accelerating searches. Thousands of lines of codes are devoted to this, and are just yet another part of the regex crate that doesn't really lend itself to being easily generic over arbitrary alphabets. You could probably write different versions of each of these algorithms for u8, u16 and u32 types (and that would in turn require using different vector operations because of the lane size differences), but they certainly aren't generalizable to arbitrary alphabets.

> I know professors like to teach that knowledge that a character is stored as an 8bit number is undefined behavior, but it sure is nice to know you have an 8-bit-clean data type, where 0 can in some contexts considered a terminator, and utf-8 is considered a special case on top of that.

I don't think regex engines have been tightly coupled to NUL terminated C strings for quite some time. Not even PCRE2 is. It's just a sequence of bytes and a length. That's it.

There does exist a gaggle of general purpose regex engines that, instead of working on UTF-8, works on UTF-16 code units. These are the Javascript, Java and .NET regex engines. None of them work on anything other than `u16` values. All of them are backtrackers. (Well, .NET does have a non-backtracking regex engine. I don't know its internals and how it arranges things with respect to alphabet size.)

PCRE2 does expose compile time configuration knobs for supporting UTF-8, UTF-16 and UTF-32. I'm not an expert on its internals, but PCRE2 is a backtracker and backtrackers tend to care less about the alphabet size than automata based engines such as the regex crate and RE2. But I'd guess that PCRE2 likely has some UTF-8 specific optimizations inside of it. They just don't show up as much (I assume) in fundamental aspects of its data structure design by virtue of its approach to regex searching.

Still, just supporting UTF-8, UTF-16 and UTF-32 is a very different thing than supporting generic alphabets.

> I'm just trying to grok the type of impossiblity you're talking about here. Yes, I could read the code instead of asking :)

It's really all about coupling and being able to make assumptions that the coupling grants you. If you remove the coupling---and that's what abstraction over arbitrary alphabets actually is---then you have to either remove the assumptions or push those assumptions into the abstraction. And whether that's even possible at all depends on the nature of how those assumptions are used in the chosen representations.

This is why Unicode is such a beastly thing to support in a regex engine. It runs roughshod all over ASCII assumptions that are extremely convenient in contexts like this. Something like `\w` that is ASCII aware can be implemented with a single DFA state. But as soon as you make `\w` Unicode-aware, you're now talking about hundreds of DFA states when you're alphabet is `u8`. You could switch to making your alphabet be the space of all Unicode codepoints (and thus collapse your DFA back down into a single state), but now your state representation basically demands to use a sparse representation which in turn requires more computational resources for a lookup (whether it's a hashmap or a linear/binary search). If your alphabet size is small enough to use a dense representation, then your lookup is literally just a pointer offset and dereference. That operation is categorically different than, say, a hash lookup and there's really nothing you can do to change that.

The bottom line here is that if you want a regex engine on a generic alphabet, then you probably don't care about perf. (As folks in this thread have come out and said.) And if you don't care about perf, it turns out you don't need something like the regex crate at all. You'd be able to build something a lot simpler. By at least an order of magnitude.

Why hasn't someone built this "simpler" library? Well, they have. I've linked to at least one elsewhere in this thread. It just turns out that they are often not productionized and don't catch on. My thesis for why is, as I've said elsewhere in this thread, that the utility of this sort of generic support is often overstated. I could be wrong. Maybe someone just really needs to do the work to make a great library. I doubt it, but it's not really a question that I want to put the work into answering.

[1]: https://old.reddit.com/r/rust/comments/4a8pbv/the_regex_crat...

Ericson2314 · on July 6, 2023

The extent to which this is infeasible is the extent to which Rust is not good enough, sigh.

E.g. it would be very nice to derive doing unicode regexes on u8 (as you say) from the serial composition of `user_regex . unicode_regex`. Fusing that composition into a single automaton is a handy generic thing to have in the library bag of tricks.

burntsushi · on July 6, 2023

That's a ridiculous conclusion to draw IMO.

Ericson2314 · on July 6, 2023

I dunno, I believe in zero cost abstractions. Call it a white whale, but I am still chasing it.

I am not disagreeing with your assessment, I am saying a better world is possible where these hurdles are not so insurmountable.

burntsushi · on July 6, 2023

I of course believe a better world is possible. That's why I went out and did something that nobody had ever done before: I produced a documented and sprawling API of regex engine internals as reusable abstractions in a separately versioned library. Rust is the only reason I was able to even consider doing that in the first place.

What you're talking about is categorically different. Your idea is so out there that I can't even begin to dream up a programming language that would resolve all of my concerns and challenges with building a general purpose regex library with Unicode support that is one of the fastest in the world, supports arbitrary alphabets and is actually something that I can maintain with reasonablish compile times.

IMO, I'm the optimist here and you're the pessimist. From my perspective, I see and celebrate what Rust has allowed me to accomplish. But from your perspective, what you see (or what you comment on, to be more precise) is what Rust has supposedly held me back from accomplishing. (And even that is something of a stretch, because it isn't at all clear to me that programming language expressivity and the concept of "zero cost abstractions" is arbitrarily powerful in the first place. See, for example, the premise of the 2014 film "Lucy".)

Ericson2314 · on July 6, 2023

So I am a big fan of your work. It is really admirable what you've done, this latest push recently. Having the stabler `regex` vs more experimental/open-ended/user-assembly-required `regex-automata` is fantastic! I wish many more libraries did this, and I will happily spread around your blog post in arguing that specific ones should do so.

I wouldn't describe the difference between our views here as optimism vs pessimism. I like Rust decently enough to use it a fair amount and contribute to it somewhat. But I'm hungry for something more. It's fine if you aren't.

> But the thing is, coupling is generally how you make things fast, and making something more general usually involves de-coupling. So they are actually and usually exclusionary goals. But it takes a lot of domain knowledge to actually see this.

The biggest difference is that I don't believe this is true. The interface may be extremely gnarly and hard to express, but I do believe the interface can always in principle be sliced. It may not be practical (the interface is far more complicated than the thing before cutting) but it is still in principle possible.

---------------------

Let me put it this way, imagine some alternative world Unicode which was pathological in all the same ways (case, variable width encodings, all the other things I am forgetting). Imagine if someone forked your work to handle that. Clearly there would be lots of small things different --- it would be hard to abstract over --- but the broad brushstrokes would be the same?

I am not interested in supporting things which are not like Unicode, agree stray to far towards large alphabets etc. and all the algorithms one would want would be different anyways. But supporting things like Unicode or strictly easier than Unicode should be doable in some language, I think.

burntsushi · on July 6, 2023

> The biggest difference is that I don't believe this is true. The interface may be extremely gnarly and hard to express, but I do believe the interface can always in principle be sliced. It may not be practical (the interface is far more complicated than the thing before cutting) but it is still in principle possible.

Well "practicality" matters. And I used weasel words such as "generally."

It's one thing to make concrete suggestions about how to actually achieve the thing you're looking for. But when you're way out in right field picking daisies, it just comes across (to me) as unproductive.

> but the broad brushstrokes would be the same?

This is basically impossible to answer. I could just as well say that the "broad brushstrokes" are the same between the `regex` and `regex-lite` crates.