Poor performance of regular expressions?

Posted by Eyal Schneider on May 21, 2009 The java.util.regex package was added to Java in version 1.4. It is a very powerful tool and one needs to become a master to use it correctly. Even when a regular expression is true, it can be very slow if not written intelligently. Continue reading if you want to understand the cause of the problems, or scroll to the end of the page where you will find 10 useful tips for improving the performance of regular expressions in Java.

Is it really that slow?

Let's say we want to select only lines containing the sequence of characters "a" and "b". The correct solution might be: (a*b*)* However, if you run the expression with, for example, the string “aaaaaaaaaaaaaaaaaaaaaaaaaaaaax” , it will take several minutes before it finishes and reports no matches! Of course, the best regex in this case would be: (a|b)* This takes less than a millisecond on my machine with the same string. Clearly there is a performance issue here.

Why is this happening?

Like most regexp engines, Java uses an NFA (Non-Deterministic Finite Automata) approach. The engine scans the regex components one by one and advances through the input string accordingly. And he can go back to the beginning to find appropriate alternatives if he reaches a “dead end”. Alternative results are obtained by using regular structures such as quantifiers ( *, +, ? ) and alternations (eg a|b|c|d ). This research technique is called backtracking. In the terrible example above, the engine will actually look through ALL series decompositions of the symbol "a" into smaller series until it realizes that there are no matches. This example shows how the backtracking algorithm can result in an exponential time estimate (depending on the length of the input string). This also shows an important property of NFA: there will always be worst cases that almost match the pattern. If a match is found, the search stops. The other main approach for use in regex is DFA (Deterministic Finite Automaton). In this approach, the regular expression actually builds an automaton that is used to traverse the input strings character by character without backtracking. This gives linear time to the entire input, regardless of the complexity of the regular expression. Instead of sequentially scanning a string for matches (as in NFA), DFA simulates parallel scanning. So why do Java (and .NET, Perl, Python, Ruby, PHP, etc.) use NKA and not DKA which has much better behavior? The reason is that NKA has a number of significant advantages:

Compiles faster and requires much less memory
Allows some useful features (see Sun's tutorial for details ):

Group Capture and Backlinks
Positional check
Extended Quantifiers (Greedy and Lazy)

It is important to note that the popular terms NKA and DKA are imprecise when used in the context of regular expressions. In theory, these two models have the same computing power. This means that you cannot write a regular expression in one automata model that would be impossible to express in another. In practice, there is a need for more capabilities so that the two types of implementation diverge in semantics. NKA engines provide more flexibility making them superior to DKA in computing power. Due to the speed of DFA and the unique features of NFA, there are 2 more “prefabricated” ways to implement regular expressions. Some implementations use both types (eg GNU egrep, which selects a specific engine at runtime), and some have managed to implement a truly hybrid version (eg Tcl regexps) with all the benefits.

Adviсe

The following are some tips on how to avoid regex efficiency problems in Java. Many of them are aimed at reducing returns.

1) Pre-compilation

Trite, but worth mentioning. If you use the regexp more than once, be sure to compile it in advance:

// компиляция p = Pattern.compile(regex, flags); … // использование Matcher a = p.matcher(input);

2) Lazy Quantifiers vs. Greedy Quantifiers

By default, quantifiers ( * + ? ) are greedy. This means they start matching with the longest possible sequence and then gradually work back if necessary. If you know in advance that the matches will usually be short, you should use lazy quantifiers. They start with the smallest match and move further if necessary. Let's say we want to find only lines that match the sequence "hello". The regular .*hello.* will do everything right, but if we know that "hello" usually appears closer to the beginning of the text, then .*?hello.* will work faster on average.

3) Use super-greedy quantifiers where possible

Unlike lazy quantifiers, which affect performance but do not affect regular behavior, super-greedy quantifiers can actually change the meaning of a regular expression. When *+ is used instead of * , the first match will be greedy (that is, the largest possible as if it were just *), but there will be no fallback if it fails, even if this causes the entire search to fail. When might this be useful? Let's say we need to find text in quotes. The regular \"[^\"]*\" will work fine. However, it will make unnecessary indentation in negative cases (for example, “bla bla bla). Using \"[^\"]*+\" will eliminate rollbacks without changing meaning of the expression. Independent grouping achieves the same effect and gives even more control (see Sun's tutorial ).

4) Avoid group capture

Any expression in parentheses is considered a group by default. This has a small impact on performance. Make your groups "uncapturable" whenever possible by starting them with (?: instead of ( .

5) Use interleaving wisely

When interleaving is used (eg Paul|Jane|Chris ), the order in which the engine tries to match the options is the same as the order in which they appear. You can take advantage of this feature and place the most common options closer to the beginning. This will improve the average positive response time.

6) Avoid ambiguity

Write regexps in such a way as to minimize the number of different matches in the input string. For example: the regular expression (a*b*)* given at the beginning of the article allows the string "aabb" to be interpreted in too many ways: (a2b2) (a1)(a1)(b1)(b1) (a2)(b2) (a1)(a1b2) etc… Regexp (a|b)*, on the other hand, only interprets unique combinations positively. This is very important to reduce returns in near- match cases.

7) Preview

Preview allows you to add restrictions on sequences to the left/right of the current position. In particular, with a negative lookahead, you can search for lines that do not contain some sequence (what would we do without this!). How can this help increase productivity? Let's say we want to take the URL from the link tag. Consider the following regexp: a .* href=(\S*).*/ For regular tags, this expression will only match the address if the text contains the "href" attribute (\S is used for all characters except delimiters) . But on some unusual tags, for example, a rollback will occur. For example: “a href= href=href=…. href=something.” The following regexp will prevent this from happening when replacing the “.*” in an expression with something that does not match “href”: a ((?!href).)* href=(\S*)((?!href).)*/

8) Specify the length

Java contains a regexp optimizer that checks the length of the input string against the minimum and maximum lengths obtained from the regular expression. This allows you to stop searching immediately in some cases. To assist this mechanism, the number of repetitions should be specified whenever possible (for example, [01]{6} matches all binary strings six characters long).

9) Select identical lines

Sometimes strings that are the same are hidden inside groups or alternatives: (hello|hell|heel) This expression can be simplified to: he(llo|ll|el) By doing this we give the regexp optimizer more information.

10) Test your regexp

It may be wise to test the regular expression first when it will be used in a performance-critical application. Write a micro-benchmark that tests your expression on various input data. Be sure to test on data of varying lengths, and also data that closely matches your sample.

Links:

http://java.sun.com/docs/books/tutorial/essential/regex/index.html http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html?page =1 http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html http://www.devarticles.com/c/a/Java/NFA-DFA-POSIX-and-the-Mechanics -of-Expression-Processing/