JavaRush /Java Blog /Random EN /Regular expressions in Java, part 2

Regular expressions in Java, part 2

Published in the Random EN group
We present to your attention a translation of a short guide to regular expressions in Java, written by Jeff Friesen for the javaworld website . For ease of reading, we have divided the article into several parts. Regular Expressions in Java, Part 2 - 1Regular Expressions in Java, Part 1
Merging multiple ranges
You can merge multiple ranges into a single range character class by placing them side by side. For example, the class [a-zA-Z]matches all Latin alphabetic characters in lower or upper case.

Merging multiple ranges

You can merge multiple ranges into a single range character class by placing them side by side. For example, the class [a-zA-Z]matches all Latin alphabetic characters in lower or upper case.

Combining Character Classes

A character class union consists of several nested character classes and matches all the characters in the resulting union. For example, the class [a-d[m-p]]matches the characters from ato dand from mto p. Consider the following example: java RegexDemo [ab[c-e]] abcdef This example will find the characters a, b, c, dand e, for which there are matches in abcdef:
regex = [ab[c-e]]
input = abcdef
Found [a] starting at 0 and ending at 0
Found [b] starting at 1 and ending at 1
Found [c] starting at 2 and ending at 2
Found [d] starting at 3 and ending at 3
Found [e] starting at 4 and ending at 4

Character class intersection

The intersection of character classes consists of characters common to all nested classes and matches only the common characters. For example, the class [a-z&&[d-f]]matches the characters d, eand f. Consider the following example: java RegexDemo "[aeiouy&&[y]]" party Note that on my Windows operating system, double quotes are required because the command shell treats them &as a command separator. This example will only find the character ythat has a match in party:
regex = [aeiouy&&[y]]
input = party
Found [y] starting at 4 and ending at 4

Subtracting character classes

Subtracting character classes consists of all characters except those contained in nested character classes, and matches only those remaining characters. For example, the class [a-z&&[^m-p]]matches the characters from ato land from qto z: java RegexDemo "[a-f&&[^a-c]&&[^e]]" abcdefg This example will find the characters dand ffor which there are matches in abcdefg:
regex = [a-f&&[^a-c]&&[^e]]
input = abcdefg
Found [d] starting at 3 and ending at 3
Found [f] starting at 5 and ending at 5

Predefined Character Classes

Some character classes appear frequently enough in regular expressions to justify the use of shorthand notation. The class Patternoffers predefined character classes as such abbreviations. You can use them to simplify your regular expressions and minimize syntax errors. There are several categories of predefined character classes: standard, POSIX, java.lang.Characterand Unicode properties such as script, block, category, and binary. The following list shows only the category of standard classes:
  • \d: Number. Equivalent [0-9].
  • \D: Non-numeric character. Equivalent [^0-9].
  • \s: Whitespace character. Equivalent [ \t\n\x0B\f\r].
  • \S: Not a whitespace character. Equivalent [^\s].
  • \w: Word-forming symbol. Equivalent [a-zA-Z_0-9].
  • \W: Not a word-forming character. Equivalent [^\w].
The following example uses a predefined character class \wto describe all word characters in the input text: java RegexDemo \w "aZ.8 _" Look closely at the following execution results, which show that period and space characters are not considered word characters:
regex = \w
input = aZ.8 _
Found [a] starting at 0 and ending at 0
Found [Z] starting at 1 and ending at 1
Found [8] starting at 3 and ending at 3
Found [_] starting at 5 and ending at 5
Line separators
The class SDK documentation Patterndescribes the dot metacharacter as a predefined character class that matches any character except line separators (one- or two-character sequences that mark the end of a line). The exception is dotall mode (which we'll discuss next), in which dots also match line separators. The class Patterndistinguishes the following line separators:
  • carriage return character ( \r);
  • newline character (symbol for advancing paper one line) ( \n);
  • a carriage return character immediately followed by a newline character ( \r\n);
  • next line character ( \u0085);
  • line separator character ( \u2028);
  • paragraph separator symbol ( \u2029)

Captured groups

The capturing group is used to save the found set of characters for further use when searching by pattern. This construct is a sequence of characters enclosed in metacharacters by parentheses ( ( )). All characters within the captured group are considered as a single whole when searching by pattern. For example, the capture group ( Java) combines the letters J, a, vand ainto a single unit. This capture group finds all occurrences of the pattern Javain the input text. With each match, the previous stored characters Javaare replaced by the next ones. Captured groups can be nested within other captured groups. For example, in a regular expression, (Java( language))a group (language)is nested inside a group (Java). Each nested or non-nested capture group is assigned a number, starting from 1, and the numbering goes from left to right. In the previous example, (Java( language))matches capture group number 1 and (language)matches capture group number 2. In the regular expression (a)(b), (a)matches capture group number 1 and (b)capture group number 2. Regular Expressions in Java, Part 2 - 2Matches stored by capture groups can be later accessed using backreferences. Specified as a backslash character followed by a numeric character corresponding to the number of the group being captured, the backreference allows you to refer to characters in the text captured by the group. Having a backlink causes the matcher to refer to the captured group's stored search result based on the number from it, and then use the characters from that result to attempt a further search. The following example shows the use of a backreference to find grammatical errors in text: java RegexDemo "(Java( language)\2)" "The Java language language" This example (Java( language)\2)uses a regular expression to find a grammatical error with a duplicate word languageimmediately following Javain the input text "The Java language language". This regular expression specifies two groups to capture: number 1 – (Java( language)\2), corresponding to Java language languageand number 2 – (language), corresponding to the space character followed by language. The backreference \2allows the stored result of group number 2 to be revisited so that the matcher can search for the second occurrence of a space followed by language, immediately after the first occurrence of a space and language. The results of the matcher RegexDemoare as follows:
regex = (Java( language)\2)
input = The Java language language
Found [Java language language] starting at 4 and ending at 25

Boundary matchers

Sometimes you need to perform a pattern match at the beginning of a line, at word boundaries, at the end of text, etc. You can do this by using one of the class edge matchers Pattern, which are regular expression constructs that search for matches in the following locations:
  • ^: Start of line;
  • $: End of line;
  • \b: Word boundary;
  • \B: Pseudoword boundary;
  • \A: Start of text;
  • \G: End of previous match;
  • \Z: End of text, not counting the final line separator (if present);
  • \z: End of text
The following example uses the ^boundary matcher metacharacter to find lines that begin with The, followed by zero or more word characters: java RegexDemo "^The\w*" Therefore The character ^specifies that the first three characters of the input text must match consecutive pattern characters T, hand e, which can be followed by any number of word-forming symbols. Here is the result of the execution:
regex = ^The\w*
input = Therefore
Found [Therefore] starting at 0 and ending at 8
What happens if you change the command line to java RegexDemo "^The\w*" " Therefore"? No match will be found because Thereforethe input text is preceded by a space character.

Zero length matches

Sometimes, when working with edge matchers, you will encounter zero-length matches. Совпадение нулевой длиныis a match that does not contain any characters. They can occur in empty input text, at the beginning of the input text, after the last character of the input text, and between any two characters of the input text. Zero-length matches are easy to recognize because they always start and end at the same position. Consider the following example: java RegExDemo \b\b "Java is" This example searches for two consecutive word boundaries, and the results look like this:
regex = \b\b
input = Java is
Found [] starting at 0 and ending at -1
Found [] starting at 4 and ending at 3
Found [] starting at 5 and ending at 4
Found [] starting at 7 and ending at 6
We see several zero-length matches in the results. The ending positions here are one less than the starting positions, since RegexDemoI specified in the source code in Listing 1 end() – 1. Regular Expressions in Java, Part 2 - 3

Quantifiers

A quantifier is a regular expression construct that explicitly or implicitly associates a pattern with a numeric value. This numeric value determines how many times to search for the pattern. Quantifiers are divided into greedy, lazy and super-greedy:
  • The greedy quantifier ( ?, *or +) is designed to find the longest match. Can I ask X? to find one or less occurrences X, X*to find zero or more occurrences X, X+to find one or more occurrences X, X{n}to find noccurrences X, X{n,}to find at least (and possibly more) noccurrences , Xand X{n,m}to find at least nbut not more moccurrences X.
  • The lazy quantifier ( ??, *?or +?) is designed to find the shortest match. You can specify X??to search for one or less occurrences of X, X*? to find zero or more occurrences X, X+?to find one or more occurrences X, X{n}?to find noccurrences X, X{n,}?to find at least (and possibly more) noccurrences X, and X{n,m}?to find at least nbut not more than moccurrences X.
  • The super-greedy quantifier ( ?+, *+or ++) is similar to the greedy quantifier, except that the super-greedy quantifier only makes one attempt to find the longest match, while the greedy quantifier can make multiple attempts. Can be set X?+to find one or less occurrences X, X*+to find zero or more occurrences X, X++to find one or more occurrences X, X{n}+to find noccurrences of X, X{n,}+to find at least (and possibly more) noccurrences , Xand X{n,m}+ to find at least nbut not more than moccurrences X.
The following example illustrates the use of the greedy quantifier: java RegexDemo .*ox "fox box pox" Here are the results:
regex = .*ox
input = fox box pox
Found [fox box pox] starting at 0 and ending at 10
The greedy quantifier ( .*) finds the longest sequence of characters ending in ox. It consumes the entire input text and then rolls back until it detects that the input text ends with these characters. Consider now the lazy quantifier: java RegexDemo .*?ox "fox box pox" Its results:
regex = .*?ox
input = fox box pox
Found [fox] starting at 0 and ending at 2
Found [ box] starting at 3 and ending at 6
Found [ pox] starting at 7 and ending at 10
The lazy quantifier ( .*?) finds the shortest sequence of characters ending in ox. It starts with a blank string and gradually consumes characters until it finds a match. And then continues working until the input text is exhausted. Finally, let's look at the super-greedy quantifier: java RegexDemo .*+ox "fox box pox" And here are its results:
regex = .*+ox
input = fox box pox
The extra-greedy quantifier ( .*+) does not find matches because it consumes all the input text and there is nothing left to match oxat the end of the regular expression. Unlike the greedy quantifier, the super-greedy quantifier does not roll back.

Zero length matches

Sometimes when working with quantifiers you will encounter zero-length matches. For example, using the following greedy quantifier results in multiple zero-length matches: java RegexDemo a? abaa The results of running this example:
regex = a?
input = abaa
Found [a] starting at 0 and ending at 0
Found [] starting at 1 and ending at 0
Found [a] starting at 2 and ending at 2
Found [a] starting at 3 and ending at 3
Found [] starting at 4 and ending at 3
There are five matches in the execution results. Although the first, third and fourth are quite expected (they correspond to the positions of three letters ain abaa), the second and fifth may surprise you. It seems as if they indicate what acorresponds bto the end of the text, but in reality this is not the case. The regular expression a?does not search bat the end of the text. It searches for presence or absence a. When a?it doesn't find a, it reports it as a zero-length match.

Nested flag expressions

Matchers make some default assumptions that can be overridden when compiling the regular expression into a pattern. We will discuss this issue later. A regular expression allows you to override any of the defaults using a nested flag expression. This regular expression construct is specified as a metacharacter of parentheses around a question mark metacharacter ( ?), followed by a lowercase Latin letter. The class Patternunderstands the following nested flag expressions:
  • (?i): Enables case-insensitive pattern matching. For example, when using a command, java RegexDemo (?i)tree Treehousethe sequence of characters Treematches the pattern tree. The default is case-sensitive pattern search.
  • (?x): Allows the use of whitespace characters and comments starting with the metacharacter within the pattern #. The matcher will ignore both. For example, for java RegexDemo ".at(?x)#match hat, cat, and so on" mattera sequence of characters matmatches the pattern .at. By default, whitespace characters and comments are not allowed, and the matcher treats them as characters involved in the search.
  • (?s): Enables dotall mode, in which the dot metacharacter matches line separators in addition to any other character. For example, the command java RegexDemo (?s). \nwill find a newline character. The default is the opposite of dotall: no line separators will be found. For example, the command Java RegexDemo . \nwill not find a newline character.
  • (?m): Activates multiline mode, where it ^matches the beginning and $the end of each line. For example, java RegexDemo "(?m)^abc$" abc\nabcfinds both sequences in the input text abc. By default, single-line mode is used: ^matches the beginning of the entire input text, and $matches the end of it. For example, java RegexDemo "^abc$" abc\nabcreturns a response that there are no matches.
  • (?u): Enables Unicode-sensitive case alignment. This flag, when used in conjunction with (?i), allows for case-insensitive pattern matching in accordance with the Unicode standard. The default setting is to search for case-sensitive and US-ASCII characters only.
  • (?d): Enables Unix-style string mode, where the matcher recognizes metacharacters in context ., ^and $only the line separator \n. The default is non-Unix style string mode: the matcher recognizes, in the context of the above metacharacters, all line delimiters.
Nested flag expressions resemble captured groups because their characters are surrounded by parenthesis metacharacters. Unlike captured groups, nested flag expressions are an example of non-captured groups, which are a regular expression construct that does not capture text characters. They are defined as sequences of characters surrounded by metacharacters of parentheses.
Specifying Multiple Nested Flag Expressions
It is possible to specify multiple nested flag expressions in a regular expression by either placing them side by side ( (?m)(?i))) or placing the letters that define them sequentially ( (?mi)).

Conclusion

As you've probably realized by now, regular expressions are extremely useful and become even more useful as you master the nuances of their syntax. So far I've introduced you to the basics of regular expressions and the Pattern. In Part 2, we'll look deeper into the Regex API and explore the methods of the Pattern, Matcherand PatternSyntaxException. I'll also show you two practical applications of the Regex API that you can immediately use in your programs. Regular Expressions in Java, Part 3 Regular Expressions in Java, Part 4 Regular Expressions in Java, Part 5
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION