We present to your attention a translation of a short guide to regular expressions in Java, written by Jeff Friesen for the
javaworld website . For ease of reading, we have divided the article into several parts.
Regular Expressions in Java, Part 1
Merging multiple ranges |
You can merge multiple ranges into a single range character class by placing them side by side. For example, the class [a-zA-Z] matches all Latin alphabetic characters in lower or upper case. |
Merging multiple ranges
You can merge multiple ranges into a single range character class by placing them side by side. For example, the class
[a-zA-Z]
matches all Latin alphabetic characters in lower or upper case.
Combining Character Classes
A character class union consists of several nested character classes and matches all the characters in the resulting union. For example, the class
[a-d[m-p]]
matches the characters from
a
to
d
and from
m
to
p
. Consider the following example:
java RegexDemo [ab[c-e]] abcdef
This example will find the characters
a
,
b
,
c
,
d
and
e
, for which there are matches in
abcdef
:
regex = [ab[c-e]]
input = abcdef
Found [a] starting at 0 and ending at 0
Found [b] starting at 1 and ending at 1
Found [c] starting at 2 and ending at 2
Found [d] starting at 3 and ending at 3
Found [e] starting at 4 and ending at 4
Character class intersection
The intersection of character classes consists of characters common to all nested classes and matches only the common characters. For example, the class
[a-z&&[d-f]]
matches the characters
d
,
e
and
f
. Consider the following example:
java RegexDemo "[aeiouy&&[y]]" party
Note that on my Windows operating system, double quotes are required because the command shell treats them
&
as a command separator. This example will only find the character
y
that has a match in
party
:
regex = [aeiouy&&[y]]
input = party
Found [y] starting at 4 and ending at 4
Subtracting character classes
Subtracting character classes consists of all characters except those contained in nested character classes, and matches only those remaining characters. For example, the class
[a-z&&[^m-p]]
matches the characters from
a
to
l
and from
q
to
z
:
java RegexDemo "[a-f&&[^a-c]&&[^e]]" abcdefg
This example will find the characters
d
and
f
for which there are matches in
abcdefg
:
regex = [a-f&&[^a-c]&&[^e]]
input = abcdefg
Found [d] starting at 3 and ending at 3
Found [f] starting at 5 and ending at 5
Predefined Character Classes
Some character classes appear frequently enough in
regular expressions to justify the use of shorthand notation. The class
Pattern
offers predefined character classes as such abbreviations. You can use them to simplify your regular expressions and minimize syntax errors. There are several categories of predefined character classes: standard, POSIX,
java.lang.Character
and Unicode properties such as script, block, category, and binary. The following list shows only the category of standard classes:
\d
: Number. Equivalent [0-9]
.
\D
: Non-numeric character. Equivalent [^0-9]
.
\s
: Whitespace character. Equivalent [ \t\n\x0B\f\r]
.
\S
: Not a whitespace character. Equivalent [^\s]
.
\w
: Word-forming symbol. Equivalent [a-zA-Z_0-9]
.
\W
: Not a word-forming character. Equivalent [^\w]
.
The following example uses a predefined character class
\w
to describe all word characters in the input text:
java RegexDemo \w "aZ.8 _"
Look closely at the following execution results, which show that period and space characters are not considered word characters:
regex = \w
input = aZ.8 _
Found [a] starting at 0 and ending at 0
Found [Z] starting at 1 and ending at 1
Found [8] starting at 3 and ending at 3
Found [_] starting at 5 and ending at 5
Line separators |
The class SDK documentation Pattern describes the dot metacharacter as a predefined character class that matches any character except line separators (one- or two-character sequences that mark the end of a line). The exception is dotall mode (which we'll discuss next), in which dots also match line separators. The class Pattern distinguishes the following line separators:
- carriage return character (
\r );
- newline character (symbol for advancing paper one line) (
\n );
- a carriage return character immediately followed by a newline character (
\r\n );
- next line character (
\u0085 );
- line separator character (
\u2028 );
- paragraph separator symbol (
\u2029 )
|
Captured groups
The capturing group is used to save the found set of characters for further use when searching by pattern. This construct is a sequence of characters enclosed in metacharacters by parentheses (
( )
). All characters within the captured group are considered as a single whole when searching by pattern. For example, the capture group (
Java
) combines the letters
J
,
a
,
v
and
a
into a single unit. This capture group finds all occurrences of the pattern
Java
in the input text. With each match, the previous stored characters
Java
are replaced by the next ones. Captured groups can be nested within other captured groups. For example, in a regular expression,
(Java( language))
a group
(language)
is nested inside a group
(Java)
. Each nested or non-nested capture group is assigned a number, starting from 1, and the numbering goes from left to right. In the previous example,
(Java( language))
matches capture group number 1 and
(language)
matches capture group number 2. In the regular expression
(a)(b)
,
(a)
matches capture group number 1 and
(b)
capture group number 2.
Matches stored by capture groups can be later accessed using backreferences. Specified as a backslash character followed by a numeric character corresponding to the number of the group being captured, the backreference allows you to refer to characters in the text captured by the group. Having a backlink causes the matcher to refer to the captured group's stored search result based on the number from it, and then use the characters from that result to attempt a further search. The following example shows the use of a backreference to find grammatical errors in text:
java RegexDemo "(Java( language)\2)" "The Java language language"
This example
(Java( language)\2)
uses a regular expression to find a grammatical error with a duplicate word
language
immediately following
Java
in the input text
"The Java language language"
. This regular expression specifies two groups to capture: number 1 –
(Java( language)\2)
, corresponding to
Java language language
and number 2 –
(language)
, corresponding to the space character followed by
language
. The backreference
\2
allows the stored result of group number 2 to be revisited so that the matcher can search for the second occurrence of a space followed by
language
, immediately after the first occurrence of a space and
language
. The results of the matcher
RegexDemo
are as follows:
regex = (Java( language)\2)
input = The Java language language
Found [Java language language] starting at 4 and ending at 25
Boundary matchers
Sometimes you need to perform a pattern match at the beginning of a line, at word boundaries, at the end of text, etc. You can do this by using one of the class edge matchers
Pattern
, which are regular expression constructs that search for matches in the following locations:
^
: Start of line;
$
: End of line;
\b
: Word boundary;
\B
: Pseudoword boundary;
\A
: Start of text;
\G
: End of previous match;
\Z
: End of text, not counting the final line separator (if present);
\z
: End of text
The following example uses the
^
boundary matcher metacharacter to find lines that begin with
The
, followed by zero or more word characters:
java RegexDemo "^The\w*" Therefore
The character
^
specifies that the first three characters of the input text must match consecutive pattern characters
T
,
h
and
e
, which can be followed by any number of word-forming symbols. Here is the result of the execution:
regex = ^The\w*
input = Therefore
Found [Therefore] starting at 0 and ending at 8
What happens if you change the command line to
java RegexDemo "^The\w*" " Therefore"
? No match will be found because
Therefore
the input text is preceded by a space character.
Zero length matches
Sometimes, when working with edge matchers, you will encounter zero-length matches.
Совпадение нулевой длины
is a match that does not contain any characters. They can occur in empty input text, at the beginning of the input text, after the last character of the input text, and between any two characters of the input text. Zero-length matches are easy to recognize because they always start and end at the same position. Consider the following example:
java RegExDemo \b\b "Java is"
This example searches for two consecutive word boundaries, and the results look like this:
regex = \b\b
input = Java is
Found [] starting at 0 and ending at -1
Found [] starting at 4 and ending at 3
Found [] starting at 5 and ending at 4
Found [] starting at 7 and ending at 6
We see several zero-length matches in the results. The ending positions here are one less than the starting positions, since
RegexDemo
I specified in the source code in Listing 1
end() – 1
.
Quantifiers
A quantifier is a regular expression construct that explicitly or implicitly associates a pattern with a numeric value. This numeric value determines how many times to search for the pattern. Quantifiers are divided into greedy, lazy and super-greedy:
- The greedy quantifier (
?
, *
or +
) is designed to find the longest match. Can I ask X
? to find one or less occurrences X
, X*
to find zero or more occurrences X
, X+
to find one or more occurrences X
, X{n}
to find n
occurrences X
, X{n,}
to find at least (and possibly more) n
occurrences , X
and X{n,m}
to find at least n
but not more m
occurrences X
.
- The lazy quantifier (
??
, *?
or +?
) is designed to find the shortest match. You can specify X??
to search for one or less occurrences of X
, X*
? to find zero or more occurrences X
, X+?
to find one or more occurrences X
, X{n}?
to find n
occurrences X
, X{n,}?
to find at least (and possibly more) n
occurrences X
, and X{n,m}?
to find at least n
but not more than m
occurrences X
.
- The super-greedy quantifier (
?+
, *+
or ++
) is similar to the greedy quantifier, except that the super-greedy quantifier only makes one attempt to find the longest match, while the greedy quantifier can make multiple attempts. Can be set X?+
to find one or less occurrences X
, X*+
to find zero or more occurrences X
, X++
to find one or more occurrences X
, X{n}+
to find n
occurrences of X
, X{n,}+
to find at least (and possibly more) n
occurrences , X
and X{n,m}+
to find at least n
but not more than m
occurrences X
.
The following example illustrates the use of the greedy quantifier:
java RegexDemo .*ox "fox box pox"
Here are the results:
regex = .*ox
input = fox box pox
Found [fox box pox] starting at 0 and ending at 10
The greedy quantifier (
.*
) finds the longest sequence of characters ending in
ox
. It consumes the entire input text and then rolls back until it detects that the input text ends with these characters. Consider now the lazy quantifier:
java RegexDemo .*?ox "fox box pox"
Its results:
regex = .*?ox
input = fox box pox
Found [fox] starting at 0 and ending at 2
Found [ box] starting at 3 and ending at 6
Found [ pox] starting at 7 and ending at 10
The lazy quantifier (
.*?
) finds the shortest sequence of characters ending in
ox
. It starts with a blank string and gradually consumes characters until it finds a match. And then continues working until the input text is exhausted. Finally, let's look at the super-greedy quantifier:
java RegexDemo .*+ox "fox box pox"
And here are its results:
regex = .*+ox
input = fox box pox
The extra-greedy quantifier (
.*+
) does not find matches because it consumes all the input text and there is nothing left to match
ox
at the end of the regular expression. Unlike the greedy quantifier, the super-greedy quantifier does not roll back.
Zero length matches
Sometimes when working with quantifiers you will encounter zero-length matches. For example, using the following greedy quantifier results in multiple zero-length matches:
java RegexDemo a? abaa
The results of running this example:
regex = a?
input = abaa
Found [a] starting at 0 and ending at 0
Found [] starting at 1 and ending at 0
Found [a] starting at 2 and ending at 2
Found [a] starting at 3 and ending at 3
Found [] starting at 4 and ending at 3
There are five matches in the execution results. Although the first, third and fourth are quite expected (they correspond to the positions of three letters
a
in
abaa
), the second and fifth may surprise you. It seems as if they indicate what
a
corresponds
b
to the end of the text, but in reality this is not the case. The regular expression
a?
does not search
b
at the end of the text. It searches for presence or absence
a
. When
a?
it doesn't find
a
, it reports it as a zero-length match.
Nested flag expressions
Matchers make some default assumptions that can be overridden when compiling the regular expression into a pattern. We will discuss this issue later. A regular expression allows you to override any of the defaults using a nested flag expression. This regular expression construct is specified as a metacharacter of parentheses around a question mark metacharacter (
?
), followed by a lowercase Latin letter. The class
Pattern
understands the following nested flag expressions:
(?i)
: Enables case-insensitive pattern matching. For example, when using a command, java RegexDemo (?i)tree Treehouse
the sequence of characters Tree
matches the pattern tree
. The default is case-sensitive pattern search.
(?x)
: Allows the use of whitespace characters and comments starting with the metacharacter within the pattern #
. The matcher will ignore both. For example, for java RegexDemo ".at(?x)#match hat, cat, and so on" matter
a sequence of characters mat
matches the pattern .at
. By default, whitespace characters and comments are not allowed, and the matcher treats them as characters involved in the search.
(?s)
: Enables dotall mode, in which the dot metacharacter matches line separators in addition to any other character. For example, the command java RegexDemo (?s). \n
will find a newline character. The default is the opposite of dotall: no line separators will be found. For example, the command Java RegexDemo . \n
will not find a newline character.
(?m)
: Activates multiline mode, where it ^
matches the beginning and $
the end of each line. For example, java RegexDemo "(?m)^abc$" abc\nabc
finds both sequences in the input text abc
. By default, single-line mode is used: ^
matches the beginning of the entire input text, and $
matches the end of it. For example, java RegexDemo "^abc$" abc\nabc
returns a response that there are no matches.
(?u)
: Enables Unicode-sensitive case alignment. This flag, when used in conjunction with (?i)
, allows for case-insensitive pattern matching in accordance with the Unicode standard. The default setting is to search for case-sensitive and US-ASCII characters only.
(?d)
: Enables Unix-style string mode, where the matcher recognizes metacharacters in context .
, ^
and $
only the line separator \n
. The default is non-Unix style string mode: the matcher recognizes, in the context of the above metacharacters, all line delimiters.
Nested flag expressions resemble captured groups because their characters are surrounded by parenthesis metacharacters. Unlike captured groups, nested flag expressions are an example of non-captured groups, which are a regular expression construct that does not capture text characters. They are defined as sequences of characters surrounded by metacharacters of parentheses.
Specifying Multiple Nested Flag Expressions |
It is possible to specify multiple nested flag expressions in a regular expression by either placing them side by side ( (?m)(?i)) ) or placing the letters that define them sequentially ( (?mi) ). |
Conclusion
As you've probably realized by now, regular expressions are extremely useful and become even more useful as you master the nuances of their syntax. So far I've introduced you to the basics of regular expressions and the
Pattern
. In Part 2, we'll look deeper into the Regex API and explore the methods of the
Pattern
,
Matcher
and
PatternSyntaxException
. I'll also show you two practical applications of the Regex API that you can immediately use in your programs.
Regular Expressions in Java, Part 3 Regular Expressions in Java, Part 4 Regular Expressions in Java, Part 5
GO TO FULL VERSION