We bring to your attention a translation of a short guide to regular expressions in the Java language, written by Jeff Friesen for the
JavaWorld website . For ease of reading, we have divided the article into several parts.
Using the Regular Expression API in Java Programs to Recognize and Describe Patterns
Java's character and various string data types provide low-level support for pattern matching, but using them for this purpose typically adds significant code complexity. Simpler and more performant code is obtained by using the Regex API ("Regular Expression API"). This tutorial will help you get started with regular expressions and the Regex API. We'll first discuss the three most interesting classes in the package in general
java.util.regex
, and then take a look inside the class
Pattern
and explore its sophisticated pattern-matching constructs.
Attention: You can download the source code (created by Jeff Friesen for the JavaWorld site) of the demo application from this article
from here .
What are regular expressions?
A regular expression (regular expression/regex/regexp) is a string that is a pattern that describes a certain set of strings. The pattern determines which rows belong to the set. The pattern consists of literals and metacharacters—characters with a special meaning rather than a literal meaning. Pattern matching is a search of text to find matches, that is, strings that match a regular expression pattern. Java supports pattern matching through its Regex API. This API consists of three classes:
Pattern
,
Matcher
and
PatternSyntaxException
, located in the package
java.util.regex
:
- class objects
Pattern
, also called templates, are compiled regular expressions.
- class objects
Matcher
, or matchers, are pattern interpretation mechanisms for finding matches in character sequences (objects whose classes implement an interface java.lang.CharSequence
and serve as text sources).
- Class objects
PatternSyntaxException
are used to describe invalid regular expression patterns.
Java also provides support for pattern matching through various methods of the
java.lang.String
. For example, the function
boolean matches (String regex)
returns
true
only if the calling string matches the regular expression exactly
regex
.
Convenient Methods |
matches() and other regular expression-oriented convenience methods of the class String are implemented under the hood in a similar way to the Regex API. |
RegexDemo
I created an application
RegexDemo
to demonstrate Java regular expressions and various methods of the
Pattern
,
Matcher
and
PatternSyntaxException
. Below is the source code for this demo application. Listing 1. Regular expression demonstration
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
public class RegexDemo
{
public static void main(String[] args)
{
if (args.length != 2)
{
System.err.println("usage: java RegexDemo regex input");
return;
}
args[1] = args[1].replaceAll("\\\\n", "\n");
try
{
System.out.println("regex = " + args[0]);
System.out.println("input = " + args[1]);
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[1]);
while (m.find())
System.out.println("Found [" + m.group() + "] starting at "
+ m.start() + " and ending at " + (m.end() - 1));
}
catch (PatternSyntaxException pse)
{
System.err.println("Неправильное регулярное выражение: " + pse.getMessage());
System.err.println("Описание: " + pse.getDescription());
System.err.println("Позиция: " + pse.getIndex());
System.err.println("Неправильный шаблон: " + pse.getPattern());
}
}
}
main
The first thing a class method does
RegexDemo
is check its command line. It requires two arguments: the first is a regular expression, and the second is the input text in which the regular expression will be searched. You may need to use a newline character within the input text
(\n)
. This can only be done by specifying the character
\
followed by the character
n
. The function
main()
converts this character sequence to the Unicode value 10.
The bulk of the code
RegexDemo
is enclosed in the
try-catch
. The block
try
first outputs the given regular expression and the input text, and then creates an object
Pattern
that stores the compiled regular expression (regular expressions are compiled to improve pattern matching performance). A matcher is extracted from the object
Pattern
and used to search for matches iteratively until all are found. The block
catch
calls several class methods
PatternSyntaxException
to retrieve useful information about the exception. This information is sequentially output to the output stream. There is no need to know the details of how the code works yet: they will become clear when we study the API in the second part of the article. However, you must compile Listing 1. Take the code from Listing 1, and then type the following command at the command prompt to compile
RegexDemo
:
javac RegexDemo.java
The Pattern class and its constructs
The class
Pattern
, the first of three classes that make up the Regex API, is a compiled representation of a regular expression. The class SDK documentation
Pattern
describes a variety of regular expression constructs, but if you don't actively use regular expressions, parts of this documentation may be confusing. What are quantifiers and what is the difference between greedy, reluctant and possessive quantifiers? What are character classes, boundary matchers, back references, and embedded flag expressions? I will answer these and other questions in the following sections.
Literal strings
The simplest regular expression construct is a literal string. For pattern matching to be successful, some part of the input text must match the pattern of that construct. Consider the following example:
java RegexDemo apple applet
In this example, we are trying to find a match for a pattern
apple
in the input text
applet
. The following result shows the match found:
regex = apple
input = applet
Found [apple] starting at 0 and ending at 4
We see in the output the regular expression and the input text, and then an indication of successful detection
apple
in the applet. In addition, the starting and ending positions of this match are given:
0
and
4
, respectively. The start position indicates the first place in the text where a match was found, and the end position indicates the last point of the match. Now let's say we gave the following command line:
java RegexDemo apple crabapple
This time we get the following result, with different starting and ending positions:
regex = apple
input = crabapple
Found [apple] starting at 4 and ending at 8
Otherwise, with and
applet
as the regular expression
apple
- the input text, no matches will be found. The entire regular expression must match, but in this case, the input text does not contain
t
after
apple
.
Metacharacters
More interesting regular expression constructs combine literal characters with metacharacters. For example, in a regular expression
a.b
, the dot metacharacter
(.)
means any character between
a
and b. Consider the following example:
java RegexDemo .ox "The quick brown fox jumps over the lazy ox."
This example uses
.ox
both as a regular expression and
The quick brown fox jumps over the lazy ox.
as input text.
RegexDemo
searches the text for matches starting with any character and ending with
ox.
The results of its execution are as follows:
regex = .ox
input = The quick brown fox jumps over the lazy ox.
Found [fox] starting at 16 and ending at 18
Found [ ox] starting at 39 and ending at 41
In the output we see two matches:
fox
and
ox
(with a space character in front of it). The metacharacter
.
matches a character
f
in the first case and a space in the second. What happens if you replace it
.ox
with a metacharacter
.
? That is, what we get as a result of the following command line:
java RegexDemo . "The quick brown fox jumps over the lazy ox."
Since the dot metacharacter matches any character,
RegexDemo
will output matches found for all characters (including the trailing dot character) of the input text:
regex = .
input = The quick brown fox jumps over the lazy ox.
Found [T] starting at 0 and ending at 0
Found [h] starting at 1 and ending at 1
Found [e] starting at 2 and ending at 2
Found [ ] starting at 3 and ending at 3
Found [q] starting at 4 and ending at 4
Found [u] starting at 5 and ending at 5
Found [i] starting at 6 and ending at 6
Found [c] starting at 7 and ending at 7
Found [k] starting at 8 and ending at 8
Found [ ] starting at 9 and ending at 9
Found [b] starting at 10 and ending at 10
Found [r] starting at 11 and ending at 11
Found [o] starting at 12 and ending at 12
Found [w] starting at 13 and ending at 13
Found [n] starting at 14 and ending at 14
Found [ ] starting at 15 and ending at 15
Found [f] starting at 16 and ending at 16
Found [o] starting at 17 and ending at 17
Found [x] starting at 18 and ending at 18
Found [ ] starting at 19 and ending at 19
Found [j] starting at 20 and ending at 20
Found [u] starting at 21 and ending at 21
Found [m] starting at 22 and ending at 22
Found [p] starting at 23 and ending at 23
Found [s] starting at 24 and ending at 24
Found [ ] starting at 25 and ending at 25
Found [o] starting at 26 and ending at 26
Found [v] starting at 27 and ending at 27
Found [e] starting at 28 and ending at 28
Found [r] starting at 29 and ending at 29
Found [ ] starting at 30 and ending at 30
Found [t] starting at 31 and ending at 31
Found [h] starting at 32 and ending at 32
Found [e] starting at 33 and ending at 33
Found [ ] starting at 34 and ending at 34
Found [l] starting at 35 and ending at 35
Found [a] starting at 36 and ending at 36
Found [z] starting at 37 and ending at 37
Found [y] starting at 38 and ending at 38
Found [ ] starting at 39 and ending at 39
Found [o] starting at 40 and ending at 40
Found [x] starting at 41 and ending at 41
Found [.] starting at 42 and ending at 42
Quote metacharacters |
To specify . or any other metacharacter as a literal character in a regular expression construct, you must escape it in one of the following ways:
- precede it with a backslash character;
- Place this metacharacter between
\Q and \E (for example, \Q.\E ).
Remember to duplicate any characters that appear in the string literal, such as String regex = "\\."; backslashes (for example, \\. or \\Q.\\E ). Do not duplicate those backslashes that are part of a command line argument. |
Character classes
Sometimes you have to limit the matches you're looking for to a specific set of characters. For example, search the text for the vowels
a
,
e
,
i
,
o
and
u
, with each occurrence of a vowel letter being considered a match. In solving such problems, we will be helped by character classes that define sets of characters between the metacharacters of square brackets (
[ ]
). The class
Pattern
supports simple character classes, range classes, inverse, union, intersection, and subtraction classes. We'll look at all of them now.
Simple Character Classes
A simple character class consists of characters placed side by side and matches only those characters. For example, the class
[abc]
matches the characters
a
,
b
and
c
. Consider the following example:
java RegexDemo [csw] cave
As you can see from the results, in this example only the character
c
for which there is a match in
cave
:
regex = [csw]
input = cave
Found [c] starting at 0 and ending at 0
Inverted character classes
An inverted character class begins with a metacharacter
^
and matches only those characters not contained in it. For example, the class
[^abc]
matches all characters except
a
,
b
and
c
. Consider the following example:
java RegexDemo "[^csw]" cave
Note that on my operating system (Windows) double quotes are required because the shell treats them
^
as an escape character. As you can see, in this example only the characters
a
,
v
and were found
e
, for which there are matches in
cave
:
regex = [^csw]
input = cave
Found [a] starting at 1 and ending at 1
Found [v] starting at 2 and ending at 2
Found [e] starting at 3 and ending at 3
Range character classes
A range character class consists of two characters separated by a hyphen (
-
). All characters, starting with the character to the left of the hyphen and ending with the character to the right, are part of the range. For example, the range
[a-z]
matches all lowercase Latin letters. This is equivalent to defining a simple class
[abcdefghijklmnopqrstuvwxyz]
. Consider the following example:
java RegexDemo [a-c] clown
This example will only match the character
c
that has a match in
clown
:
regex = [a-c]
input = clown
Found [c] starting at 0 and ending at 0
Regular Expressions in Java, Part 2 Regular Expressions in Java, Part 3 Regular Expressions in Java, Part 4 Regular Expressions in Java, Part 5
GO TO FULL VERSION