JavaRush /Java Blog /Random EN /Basics of regular expressions in Java. Part 3
articles
Level 15

Basics of regular expressions in Java. Part 3

Published in the Random EN group
Let's continue our study of regular expressions. In this article we will look at predefined character classes as well as quantification (searching for sequences). Basics of regular expressions in Java.  Part 3 - 1

Predefined Character Classes

The class API Patterncontains predefined character classes that offer convenient shortcuts to commonly used regular expressions. Basics of regular expressions in Java.  Part 3 - 2In this table, the constructs in the left column are shorthand representations of the expressions in the right column. For example, \dmeans a number (0-9), \wmeans any uppercase or lowercase letter, underscore or number). Use predefined character classes whenever possible. This will make your code easier to read and fix errors. Constructs starting with a backslash are called escaped or protected. In previous articles, we have already talked about escaping special characters with backslashes or symbols \Qand \Eusing them as regular characters. If you use a backslash with regular characters (literals), then you need to escape the backslash for the expression to compile.
private final String REGEX = "\\d"; // цифра
In this example \d, a regular expression; the additional backslash is necessary for the program to compile. Our test program reads regular expressions directly from the console, so no additional slash is needed. The following example demonstrates the use of predefined character classes: Basics of regular expressions in Java.  Part 3 - 3Basics of regular expressions in Java.  Part 3 - 4In the first three examples, the regular expression is simply " ." (the dot special character), which means any character. Therefore, the search was successful in all cases. Other examples use predefined character classes, the meanings of which we discussed in the table above.

Quantifiers

Basics of regular expressions in Java.  Part 3 - 4Quantifiers allow you to specify the number of occurrences of a character in a string. Let's take a closer look at the intricacies of how greedy, lazy, and very greedy quantifiers work. At first glance it may seem that the quantifiers X?, X?? and X?+ work the same way: “X is present once or not at all.” There are slight differences in the implementation of these quantifiers, which we will look at below.

Zero length matches

Let's start with the greedy one. Let's write three different regular expressions: the letter “a” with special characters ?, * or +. Let's see what happens if we test these regular expressions on an empty line: Basics of regular expressions in Java.  Part 3 - 5In the example above, the search was successful in the first two cases, because the expressions a? and a* allow the character a to be missing from the string. Also note that the start and last match index are the same (0). Since the input string has no length, the program finds nothing :) in the first position. This case is called a zero-length match. Such matches occur in several cases: when the input line is empty, at the beginning of the input line, after the last character of the line, or between characters in the line. Zero-length matches are easy to spot: they start and end at the same position. Let's look at some more examples of zero-length matches. Let's explore zero-length matches with a few more examples. Let's change the input string to the character "a" and observe an interesting effect: Basics of regular expressions in Java.  Part 3 - 6All three quantifiers found the character "a", but the first two, which allow for the absence of a character, found a zero-length match at position 1 - after the last character of the string. This happens because the program treats the character “a” as a string and “runs” through it until there are no more matches. Depending on the quantifier used, the program will or will not find "nothing" at the end of the string. Now let's change the input string to a sequence of five letters "a": Basics of regular expressions in Java.  Part 3 - 7Regular expression a? finds a match for each letter in the string separately. The expression a* finds two matches: the character sequence "a"' and a zero-length match at position 5. And finally, the regular expression a+ finds only the sequence of characters “a”, without finding “nothing” :) What will happen if a string containing different characters is given as input? For example, "ababaaaab": Basics of regular expressions in Java.  Part 3 - 8The character "b" is in positions 1, 3, and 8 and the program finds zero-length matches at these positions. Regular expression a? does not pay attention to "b", but simply looks for the presence (or absence) of the character "a". If the quantifier allows the absence of "a", all characters in the string other than "a" will be shown as a zero-length match. To find sequences of a given length, simply specify the length in curly braces: Basics of regular expressions in Java.  Part 3 - 9The regular expression a{3} searches for a sequence of three "a" characters. Nothing was found in the first line because there weren't enough a's in the line. The second contains 3 characters, which the program finds. The third test also finds a match at the beginning of the string. Everything after the 3rd character does not satisfy the regular expression, in the code below it does and there will be several matches: Basics of regular expressions in Java.  Part 3 - 10To specify the minimum sequence length, use:
Enter your regex: a{3,}
Enter input string to search: aaaaaaaaa
I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.
In this example, the program finds only one match because the string meets the minimum sequence length requirement of (3) "a" characters. Finally, setting the maximum sequence length: Basics of regular expressions in Java.  Part 3 - 11In this example, the first match ended on the sixth character. The second match contains characters after the sixth one, because they satisfy the minimum length requirement. If the string were one character shorter, there would be no second match.

Using character groups and classes with quantifiers

Up to this point, we have tested quantifiers on strings containing the same character. Quantifiers only apply to a single character, so the regular expression "abc+" will match strings containing "ab" and "c" one or more times. It will not mean "abc" one or more times. But quantifiers can be used in conjunction with groups and character classes, such as [abc]+ (a or b or c, one or more times) or (abc)+ (“abc” one or more times). Let's find a group of characters (dog), three times in a line: Basics of regular expressions in Java.  Part 3 - 12In the first example, the program finds a match, because the quantifier extends to a group of characters. If you remove the brackets, the quantifier {3} will only apply to the letter “g”. You can also use quantifiers with character classes: Basics of regular expressions in Java.  Part 3 - 13The {3} quantifier applies to the character class in brackets in the first example, and in the second - only to the character “c”.

Differences between greedy, lazy and over-greedy quantifiers

There are slight differences between greedy, reluctant, and possessive quantifiers. Greedy quantifiers are so named because they try to find the longest possible match: the program first tries to "eat" the entire string, if a match is not found, then one character is discarded and the search is repeated until a match is found or no more characters remain. Lazy people, on the other hand, start at the beginning of the line, adding character after character until they find a match. Finally, jealous quantification scans the entire string once, without removing characters as in greedy. For demonstration, we will use the string xfooxxxxxxfoo. Basics of regular expressions in Java.  Part 3 - 14The first example uses the greedy .* quantifier to find any character, 0 or more times, followed by the characters "f" "o" "o". Since the cantifier is greedy, the match found contains the entire string. A greedy quantifier will not find all matches in a string because in the first step, after scanning the entire string, it will find a match and finish the job. The second example is lazy and starts from the beginning of the line, adding character by character. The program begins by checking for “emptiness”, but since the sequence "foo" is not at the beginning of the line, the search continues with the addition of the character "x", after which the first match will be found between indices 0 and 4. The search continues until the end of the line and the second match will be found between indices 4 and 13. The third example does not find coincidences because the quantifier is jealous. In this case, the regular expression .*+ "ate" the entire line, leaving nothing for "foo". Use the jealous quantifier when you need to discard anything unnecessary in a string, it will be more effective than the equivalent greedy quantifier. That's all! Link to source: Basics of regular expressions in Java. Part 3
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION