Welcome to Community Server Sign in | Join | Help

Regex: Speed Kills

As an update on my attempt to find out more about balancing groups, I did manage to get a balancing group example with tags going earlier this evening. Barely. More relevantly I twiddled and twiddled until I came up with the right regex to do what had originally made me wonder about balancing groups in the first place.

In a nutshell, I wanted to be able to search for a term and upon finding it replace it with a valid substition. More accurately, I wanted to be able to throw some tags around this search term. Pretty straightforward, but I wanted to do it this using html (potentially tag soup really) and there were a couple constraints. The substitution wouldn't happen if the search term was located within a set of html anchor tags, and it wouldn't happen inside a tag of any kind itself.

The regex ended up being a single regex that matched the search text, and if it was inside of any tags, it also captured the names of those tags. It ended up being a little bit jinky, but basically you can take the Regex.Match.Group for the tag names, walk the captures and make sure none of them is an anchor and then make the replacement. This worked okay, but it was a long regex with a backreference and an alternation--it was actually the one of the longer regexes I've written, but some of the ones over at RegExLib are so wide they go off the edge of my screen. [1]

Scott sent me some other code that used a trio of regexes to find the start of the next anchor tag, perform any replacements, skip past the anchor tag, and repeat the process. This was more of a scanning approach with multiple expressions but it failed a couple of the test cases I was using--mainly it had issues with tags nested inside anchors.

Now I expected this approach to be faster and initially it was. When I added a backreference to compensate for the nested tag issues, it ended up slower than the single-regex-walk-the-captures approach.

I knew a straight scan was going to be much faster, so I wrote the code to do a stack-based lookahead type of straight scan. It took a few hundred loc but it ended up being much more manageable than the regular expression approaches when I wanted to add different replacement/scope options. In other words, more specific to task, but more flexible within task scope.

And naturally it was, in fact, a whole lot faster. The straight scan approach to search and replace ended up being about 10+ times faster than the regex-based approaches. It wasn't really much of a surprise, but it was a useful reminder that, despite their flexibility and power, regular expressions come do pack a fair performance cost.

Add this to some xml parsing I was doing the other day where an XmlTextReader outperformed xpath dom pulls by a similar margin and it's starting to feel like the week of the firehose. Which reminds me, it'd be nice if there were a free and easy way to make little light state machines for these kinds of things.

Respect the firehose.

[1] Every time I go back to RegExLib (which is pretty useful when you have a common pattern to match), I keep wanting to see some more logically complex expressions. It seems like most of the ones up there are geared towards specific format matching or scenarios--I'd like to see some of the experts busting loose with some crazy zero-width-asserting-maybe-I'm-backreferences-maybe-I-don't-feel-like-it expressions.

Published Wednesday, August 27, 2003 1:43 AM by grant

Comments

Friday, September 26, 2003 4:35 AM by grant

# re: Regex: Speed Kills

Grant, here's something that I came up with recently that does a similar task.

http://weblogs.asp.net/dneimke/posts/26454.aspx
Friday, August 13, 2004 6:42 PM by grant

# re: Regex: Speed Kills

Grant, can you post your slow regex? I'm having trouble with nested balancing groups and an example, slow though it may be, would help.
Tuesday, August 17, 2004 4:45 PM by grant

# re: Regex: Speed Kills

Sorry Jon, I took a look for it, but I'm pretty sure I flushed the original prototype once it got incorporated into the main .Text trunk.
Anonymous comments are disabled