Let's continue our study of regular expressions. In this article we will look at predefined character classes as well as quantification (searching for sequences).
Predefined Character Classes
The class APIPattern
contains predefined character classes that offer convenient shortcuts to commonly used regular expressions. In this table, the constructs in the left column are shorthand representations of the expressions in the right column. For example, \d
means a number (0-9), \w
means any uppercase or lowercase letter, underscore or number). Use predefined character classes whenever possible. This will make your code easier to read and fix errors. Constructs starting with a backslash are called escaped or protected. In previous articles, we have already talked about escaping special characters with backslashes or symbols \Q
and \E
using them as regular characters. If you use a backslash with regular characters (literals), then you need to escape the backslash for the expression to compile.
private final String REGEX = "\\d"; // цифра
In this example \d
, a regular expression; the additional backslash is necessary for the program to compile. Our test program reads regular expressions directly from the console, so no additional slash is needed. The following example demonstrates the use of predefined character classes: In the first three examples, the regular expression is simply " .
" (the dot special character), which means any character. Therefore, the search was successful in all cases. Other examples use predefined character classes, the meanings of which we discussed in the table above.
Quantifiers
Quantifiers allow you to specify the number of occurrences of a character in a string. Let's take a closer look at the intricacies of how greedy, lazy, and very greedy quantifiers work. At first glance it may seem that the quantifiers X?, X?? and X?+ work the same way: “X is present once or not at all.” There are slight differences in the implementation of these quantifiers, which we will look at below.Zero length matches
Let's start with the greedy one. Let's write three different regular expressions: the letter “a” with special characters ?, * or +. Let's see what happens if we test these regular expressions on an empty line: In the example above, the search was successful in the first two cases, because the expressions a? and a* allow the character a to be missing from the string. Also note that the start and last match index are the same (0). Since the input string has no length, the program finds nothing :) in the first position. This case is called a zero-length match. Such matches occur in several cases: when the input line is empty, at the beginning of the input line, after the last character of the line, or between characters in the line. Zero-length matches are easy to spot: they start and end at the same position. Let's look at some more examples of zero-length matches. Let's explore zero-length matches with a few more examples. Let's change the input string to the character "a" and observe an interesting effect: All three quantifiers found the character "a", but the first two, which allow for the absence of a character, found a zero-length match at position 1 - after the last character of the string. This happens because the program treats the character “a” as a string and “runs” through it until there are no more matches. Depending on the quantifier used, the program will or will not find "nothing" at the end of the string. Now let's change the input string to a sequence of five letters "a": Regular expression a? finds a match for each letter in the string separately. The expression a* finds two matches: the character sequence "a"' and a zero-length match at position 5. And finally, the regular expression a+ finds only the sequence of characters “a”, without finding “nothing” :) What will happen if a string containing different characters is given as input? For example, "ababaaaab": The character "b" is in positions 1, 3, and 8 and the program finds zero-length matches at these positions. Regular expression a? does not pay attention to "b", but simply looks for the presence (or absence) of the character "a". If the quantifier allows the absence of "a", all characters in the string other than "a" will be shown as a zero-length match. To find sequences of a given length, simply specify the length in curly braces: The regular expression a{3} searches for a sequence of three "a" characters. Nothing was found in the first line because there weren't enough a's in the line. The second contains 3 characters, which the program finds. The third test also finds a match at the beginning of the string. Everything after the 3rd character does not satisfy the regular expression, in the code below it does and there will be several matches: To specify the minimum sequence length, use:Enter your regex: a{3,}
Enter input string to search: aaaaaaaaa
I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.
In this example, the program finds only one match because the string meets the minimum sequence length requirement of (3) "a" characters. Finally, setting the maximum sequence length: In this example, the first match ended on the sixth character. The second match contains characters after the sixth one, because they satisfy the minimum length requirement. If the string were one character shorter, there would be no second match.
GO TO FULL VERSION