Simplify common programming tasks with the Regex API
In Parts 1 and 2 of this article, you were introduced to regular expressions and the Regex API. You learned about the classPattern
and walked through examples that demonstrate regular expression constructs, from simple pattern matching using literal strings to more complex matching using ranges, boundary matchers, and quantifiers. In this and subsequent parts we will consider issues not covered in the first part, we will study the corresponding methods of the classes Pattern
, Matcher
and PatternSyntaxException
. You'll also learn two utilities that use regular expressions to make common programming problems easier. The first one extracts comments from code for documentation. The second is a library of reusable code designed to perform lexical analysis - an essential component of assemblers, compilers, and similar software.
DOWNLOADING SOURCE CODE
You can get all the source code (created by Jeff Friesen for JavaWorld) for the demo applications in this article from here .Learning the Regex API
Pattern
, Matcher
and PatternSyntaxException
are the three classes that make up the Regex API. Each of them provides methods that allow you to use regular expressions in your code.
Methods of the Pattern class
An instance of a classPattern
is a compiled regular expression, also known as a pattern. Regular expressions are compiled to improve the performance of pattern matching operations. The following static methods support compilation.
Pattern compile(String regex)
compiles the contentregex
into an intermediate representation that is stored in a newPattern
. This method either returns a reference to an object if successful, or throws an exceptionPatternSyntaxException
if invalid regular expression syntax is detected. Any object of the classMatcher
used byPattern
or returned from this object uses its default settings, such as case-sensitive search. As an example, the code snippetPattern p = Pattern.compile("(?m)^\\.");
creates an objectPattern
that stores a compiled representation of a regular expression to match strings that begin with a dot character.Pattern compile(String regex, int flags)
solves the same problem asPattern compile(String regex)
, but taking into accountflags
: a set of bit constants for bit flags of the OR type. The classPattern
declares constantsCANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNICODE_CHARACTER_CLASS и UNIX_LINES
that can be combined using bitwise OR (for example,CASE_INSENSITIVE | DOTALL
) and passed as an argumentflags
.
With the exception of
CANON_EQ, LITERAL и UNICODE_CHARACTER_CLASS
, these constants are an alternative to the nested flag expressions demonstrated in Part 1. If a flag constant other than those defined in the class is encountered Pattern
, the method Pattern compile(String regex, int flags)
throws an exception java.lang.IllegalArgumentException
. For example, Pattern p = Pattern.compile("^\\.", Pattern.MULTILINE);
equivalent to the previous example, with the constant Pattern.MULTILINE
and the nested flag expression (?m)
doing the same thing.
Pattern
, along with the flags it uses. To do this, you can call the following methods:
String pattern()
returns the original regular expression string compiled into aPattern
.int flags()
returns the object's flagsPattern
.
Pattern
, it is typically used to obtain the object Matcher
to perform pattern matching operations. The method Matcher matcher(Charsequence input)
creates an object Matcher
that searches text input
for a match to an object pattern Pattern
. When called, it returns a reference to this object Matcher
. For example, the command Matcher m = p.matcher(args[1]);
returns Matcher
for the object Pattern
referenced by the variable p
.
One-time search |
---|
static boolean matches(String regex, CharSequence input) The class method Pattern allows you to save on creating objects Pattern and Matcher one-time searching using a template. This method returns true if input the pattern is matched regex , otherwise it returns false. If the regular expression contains a syntax error, the method throws an exception PatternSyntaxException . For example, System.out.println(Pattern.matches("[a-z[\\s]]*", "all lowercase letters and whitespace only")); prints true , confirming that the phrase all lowercase letters and whitespace only contains only spaces and lowercase characters. |
Splitting text
Most developers have at least once written code to break up input text into its component parts, such as converting a text-based employee account into a set of fields. The classPattern
provides the ability to more conveniently solve this tedious task using two text splitting methods:
-
The method
String[] split(CharSequence text, int limit)
splitstext
according to the found matches to the object patternPattern
and returns the results in an array. Each array element specifies a text sequence separated from the next sequence by a pattern-matching text fragment (or end of text). The elements of the array are in the same order in which they appear intext
.In this method, the number of array elements depends on the parameter
limit
, which also controls the number of matches to be found.- A positive value searches for no more than
limit-1
matches and the length of the array is no more thanlimit
elements. - If the value is negative, all possible matches are searched, and the length of the array can be arbitrary.
- If the value is zero, all possible matches are searched, the length of the array can be arbitrary, and empty lines at the end are discarded.
- A positive value searches for no more than
- The method
String[] split(CharSequence text)
calls the previous method with 0 as the limit argument and returns the result of its call.
split(CharSequence text)
for solving the problem of splitting an employee account into separate fields of name, age, postal address and salary:
Pattern p = Pattern.compile(",\\s");
String[] fields = p.split("John Doe, 47, Hillsboro Road, 32000");
for (int i = 0; i < fields.length; i++)
System.out.println(fields[i]);
The code above describes a regular expression to find a comma character immediately followed by a single space character. Here are the results of its execution:
John Doe
47
Hillsboro Road
32000
Template predicates and the Streams API
In Java 8,Pattern
a method appeared in the class . This method creates a predicate (a function with a boolean value) that is used to match the pattern. The use of this method is shown in the following code snippet: Predicate
asPredicate()
List progLangs = Arrays.asList("apl", "basic", "c", "c++", "c#", "cobol", "java", "javascript", "perl", "python", "scala");
Pattern p = Pattern.compile("^c");
progLangs.stream().filter(p.asPredicate()).forEach(System.out::println);
This code creates a list of programming language names, then compiles a pattern to find all names that start with the letter c
. The last line of code above implements receiving a serial stream of data with this list as the source. It sets up a filter using a Boolean function asPredicate()
that returns true when the name begins with a letter c
and iterates through the stream, printing matching names to standard output. This last line is equivalent to the following regular loop, familiar from the RegexDemo application from Part 1:
for (String progLang: progLangs)
if (p.matcher(progLang).find())
System.out.println(progLang);
Matcher class methods
An instance of the classMatcher
describes a mechanism for performing pattern matching operations on a sequence of characters by interpreting the class's compiled regular expression Pattern
. Objects of the class Matcher
support various types of pattern search operations:
-
The method
boolean find()
searches the input text for the next match. This method begins scanning either at the beginning of the specified text or at the first character after the previous match. The second option is only possible if the previous call to this method returned true and the resolver was not reset. In any case, if the search is successful, the boolean value true is returned. An example of this method can be found inRegexDemo
Part 1. -
The method
boolean find(int start)
resets the matcher and searches the text for the next match. Viewing starts from the position specified by the parameterstart
. If the search is successful, the boolean value true is returned. For example,m.find(1);
scans the text starting from position1
(position 0 is ignored). If the parameterstart
contains a negative value or a value greater than the matcher text length, the method throws an exceptionjava.lang.IndexOutOfBoundsException
. -
The method
boolean matches()
attempts to match all text to a pattern. It returns a boolean value true if all text matches the pattern. For example, the codePattern p = Pattern.compile("\\w*"); Matcher m = p.matcher("abc!"); System.out.println(p.matches());
outputsfalse
because the character!
is not a word character. -
The method
boolean lookingAt()
tries to match the specified text with the pattern. This method returns true if any part of the text matches the pattern. Unlike the methodmatches();
, all text does not have to match the pattern. For example,Pattern p = Pattern.compile("\\w*"); Matcher m = p.matcher("abc!"); System.out.println(p.lookingAt());
it will outputtrue
, since the beginning of the textabc!
consists only of word-forming characters.
Pattern
, class objects Matcher
retain state information. Sometimes you may need to reset the matcher to clear this information after the pattern search has finished. The following methods are available to reset the resolver:
-
The method
Matcher reset()
resets the state of the matcher, including the position to be appended to the end (reset to 0). The next pattern search operation starts at the beginning of the matcher text. Returns a reference to the current objectMatcher
. For example,m.reset();
resets the resolver referenced bym
. -
The method
Matcher reset(CharSequence text)
resets the resolver state and sets the new resolver text totext
. The next pattern search operation begins at the beginning of the new matcher text. Returns a reference to the current objectMatcher
. For example,m.reset("new text");
resets the referenced resolverm
and sets the new resolver text to"new text"
.
Adding text to the end
The position of the matcher to be appended to the end specifies the beginning of the matcher text that is appended to the end of the object of typejava.lang.StringBuffer
. The following methods use this position:
-
The method
Matcher appendReplacement(StringBuffer sb, String replacement)
reads the matcher text characters and appends them to the end of the objectStringBuffer
referenced by the argumentsb
. This method stops reading at the last character preceding the previous pattern match. Next, the method appends the characters from the object of typeString
referenced by the argumentreplacement
to the end of the objectStringBuffer
(the stringreplacement
may contain references to text sequences captured during the previous search; these are specified using the characters($)
and group numbers being captured). Finally, the method sets the value of the matcher position to be appended to the position of the last matched character plus one, and then returns a reference to the current matcher. -
The method
StringBuffer appendTail(StringBuffer sb)
adds all the text to an objectStringBuffer
and returns a reference to that object. After the last method callappendReplacement(StringBuffer sb, String replacement)
, call the methodappendTail(StringBuffer sb)
to copy the remaining text to the objectStringBuffer
.
The method Matcher appendReplacement(StringBuffer sb, String replacement)
throws an exception java.lang.IllegalStateException
if the matcher has not yet found a match or a previous search attempt failed. It throws an exception IndexOutOfBoundsException
if the line replacement
specifies a capture group that is not in the pattern).
Captured groups |
---|
As you remember from Part 1, a capture group is a sequence of characters enclosed in parentheses ( () ) metacharacters. The purpose of this construct is to store the found characters for later reuse during pattern matching. All characters from the captured group are considered as a single whole during the pattern search. |
appendReplacement(StringBuffer sb, String replacement)
and methods appendTail(StringBuffer sb
to replace all occurrences of the character sequence in the source text cat
with caterpillar
:
Pattern p = Pattern.compile("(cat)");
Matcher m = p.matcher("one cat, two cats, or three cats on a fence");
StringBuffer sb = new StringBuffer();
while (m.find())
m.appendReplacement(sb, "$1erpillar");
m.appendTail(sb);
System.out.println(sb);
Using a captured group and a reference to it in the replacement text tells the program to insert erpillar
after each occurrence of cat
. The result of executing this code looks like this: one caterpillar, two caterpillars, or three caterpillars on a fence
Replacing text
The classMatcher
provides us with two methods for text replacement, complementary to the appendReplacement(StringBuffer sb, String replacement)
. Using these methods, you can replace either the first occurrence of the [replaced text] or all occurrences:
-
The method
String replaceFirst(String replacement)
resets the matcher, creates a new objectString
, copies all the characters of the matcher text (up to the first match) to this string, appends the characters from to the end of itreplacement
, copies the remaining characters to the string and returns an objectString
(the stringreplacement
can contain references to those captured during the previous search text sequences using dollar symbols and captured group numbers). -
The method
String replaceAll(String replacement)
operates similarly to the methodString replaceFirst(String replacement)
, but replacesreplacement
all found matches with characters from the string.
\s+
searches for one or more whitespace characters in the input text. Below, we will use this regular expression and call a method replaceAll(String replacement)
to remove duplicate spaces:
Pattern p = Pattern.compile("\\s+");
Matcher m = p.matcher("Удаляем \t\t лишние пробелы. ");
System.out.println(m.replaceAll(" "));
Here are the results: Удаляем лишние пробелы.
Regular Expressions in Java, Part 4 Regular Expressions in Java, Part 5
GO TO FULL VERSION