JavaRush /Java Blog /Random EN /Regular expressions in Java, part 3

Regular expressions in Java, part 3

Published in the Random EN group
We present to your attention a translation of a short guide to regular expressions in Java, written by Jeff Friesen for the javaworld website . For ease of reading, we have divided the article into several parts. Regular Expressions in Java, Part 3 - 1Regular Expressions in Java, Part 1 Regular Expressions in Java, Part 2

Simplify common programming tasks with the Regex API

In Parts 1 and 2 of this article, you were introduced to regular expressions and the Regex API. You learned about the class Patternand walked through examples that demonstrate regular expression constructs, from simple pattern matching using literal strings to more complex matching using ranges, boundary matchers, and quantifiers. In this and subsequent parts we will consider issues not covered in the first part, we will study the corresponding methods of the classes Pattern, Matcherand PatternSyntaxException. You'll also learn two utilities that use regular expressions to make common programming problems easier. The first one extracts comments from code for documentation. The second is a library of reusable code designed to perform lexical analysis - an essential component of assemblers, compilers, and similar software.

DOWNLOADING SOURCE CODE

You can get all the source code (created by Jeff Friesen for JavaWorld) for the demo applications in this article from here .

Learning the Regex API

Pattern, Matcherand PatternSyntaxExceptionare the three classes that make up the Regex API. Each of them provides methods that allow you to use regular expressions in your code.

Methods of the Pattern class

An instance of a class Patternis a compiled regular expression, also known as a pattern. Regular expressions are compiled to improve the performance of pattern matching operations. The following static methods support compilation.
  • Pattern compile(String regex)compiles the content regexinto an intermediate representation that is stored in a new Pattern. This method either returns a reference to an object if successful, or throws an exception PatternSyntaxExceptionif invalid regular expression syntax is detected. Any object of the class Matcherused by Patternor returned from this object uses its default settings, such as case-sensitive search. As an example, the code snippet Pattern p = Pattern.compile("(?m)^\\."); creates an object Patternthat stores a compiled representation of a regular expression to match strings that begin with a dot character.

  • Pattern compile(String regex, int flags)solves the same problem as Pattern compile(String regex), but taking into account flags: a set of bit constants for bit flags of the OR type. The class Patterndeclares constants CANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNICODE_CHARACTER_CLASS и UNIX_LINESthat can be combined using bitwise OR (for example, CASE_INSENSITIVE | DOTALL) and passed as an argument flags.

  • With the exception of CANON_EQ, LITERAL и UNICODE_CHARACTER_CLASS, these constants are an alternative to the nested flag expressions demonstrated in Part 1. If a flag constant other than those defined in the class is encountered Pattern, the method Pattern compile(String regex, int flags) throws an exception java.lang.IllegalArgumentException. For example, Pattern p = Pattern.compile("^\\.", Pattern.MULTILINE);equivalent to the previous example, with the constant Pattern.MULTILINEand the nested flag expression (?m)doing the same thing.
Sometimes it is necessary to obtain a copy of the original string of a regular expression compiled into an object Pattern, along with the flags it uses. To do this, you can call the following methods:
  • String pattern()returns the original regular expression string compiled into a Pattern.

  • int flags()returns the object's flags Pattern.
After receiving the object Pattern, it is typically used to obtain the object Matcherto perform pattern matching operations. The method Matcher matcher(Charsequence input)creates an object Matcherthat searches text inputfor a match to an object pattern Pattern. When called, it returns a reference to this object Matcher. For example, the command Matcher m = p.matcher(args[1]);returns Matcherfor the object Patternreferenced by the variable p.
One-time search
static boolean matches(String regex, CharSequence input)The class method Patternallows you to save on creating objects Patternand Matcherone-time searching using a template. This method returns true if inputthe pattern is matched regex, otherwise it returns false. If the regular expression contains a syntax error, the method throws an exception PatternSyntaxException. For example, System.out.println(Pattern.matches("[a-z[\\s]]*", "all lowercase letters and whitespace only"));prints true, confirming that the phrase all lowercase letters and whitespace onlycontains only spaces and lowercase characters.
Regular Expressions in Java, Part 3 - 2

Splitting text

Most developers have at least once written code to break up input text into its component parts, such as converting a text-based employee account into a set of fields. The class Patternprovides the ability to more conveniently solve this tedious task using two text splitting methods:
  • The method String[] split(CharSequence text, int limit)splits textaccording to the found matches to the object pattern Patternand returns the results in an array. Each array element specifies a text sequence separated from the next sequence by a pattern-matching text fragment (or end of text). The elements of the array are in the same order in which they appear in text.

    In this method, the number of array elements depends on the parameter limit, which also controls the number of matches to be found.

    • A positive value searches for no more than limit-1matches and the length of the array is no more than limitelements.
    • If the value is negative, all possible matches are searched, and the length of the array can be arbitrary.
    • If the value is zero, all possible matches are searched, the length of the array can be arbitrary, and empty lines at the end are discarded.

  • The method String[] split(CharSequence text)calls the previous method with 0 as the limit argument and returns the result of its call.
Below are the results of the method split(CharSequence text)for solving the problem of splitting an employee account into separate fields of name, age, postal address and salary:
Pattern p = Pattern.compile(",\\s");
String[] fields = p.split("John Doe, 47, Hillsboro Road, 32000");
for (int i = 0; i < fields.length; i++)
   System.out.println(fields[i]);
The code above describes a regular expression to find a comma character immediately followed by a single space character. Here are the results of its execution:
John Doe
47
Hillsboro Road
32000

Template predicates and the Streams API

In Java 8, Patterna method appeared in the class . This method creates a predicate (a function with a boolean value) that is used to match the pattern. The use of this method is shown in the following code snippet: Predicate asPredicate()
List progLangs = Arrays.asList("apl", "basic", "c", "c++", "c#", "cobol", "java", "javascript", "perl", "python", "scala");
Pattern p = Pattern.compile("^c");
progLangs.stream().filter(p.asPredicate()).forEach(System.out::println);
This code creates a list of programming language names, then compiles a pattern to find all names that start with the letter c. The last line of code above implements receiving a serial stream of data with this list as the source. It sets up a filter using a Boolean function asPredicate()that returns true when the name begins with a letter cand iterates through the stream, printing matching names to standard output. This last line is equivalent to the following regular loop, familiar from the RegexDemo application from Part 1:
for (String progLang: progLangs)
   if (p.matcher(progLang).find())
      System.out.println(progLang);

Matcher class methods

An instance of the class Matcherdescribes a mechanism for performing pattern matching operations on a sequence of characters by interpreting the class's compiled regular expression Pattern. Objects of the class Matchersupport various types of pattern search operations:
  • The method boolean find()searches the input text for the next match. This method begins scanning either at the beginning of the specified text or at the first character after the previous match. The second option is only possible if the previous call to this method returned true and the resolver was not reset. In any case, if the search is successful, the boolean value true is returned. An example of this method can be found in RegexDemoPart 1.

  • The method boolean find(int start)resets the matcher and searches the text for the next match. Viewing starts from the position specified by the parameter start. If the search is successful, the boolean value true is returned. For example, m.find(1);scans the text starting from position 1(position 0 is ignored). If the parameter startcontains a negative value or a value greater than the matcher text length, the method throws an exception java.lang.IndexOutOfBoundsException.

  • The method boolean matches()attempts to match all text to a pattern. It returns a boolean value true if all text matches the pattern. For example, the code Pattern p = Pattern.compile("\\w*"); Matcher m = p.matcher("abc!"); System.out.println(p.matches());outputs falsebecause the character !is not a word character.

  • The method boolean lookingAt()tries to match the specified text with the pattern. This method returns true if any part of the text matches the pattern. Unlike the method matches();, all text does not have to match the pattern. For example, Pattern p = Pattern.compile("\\w*"); Matcher m = p.matcher("abc!"); System.out.println(p.lookingAt());it will output true, since the beginning of the text abc!consists only of word-forming characters.

Unlike class objects Pattern, class objects Matcherretain state information. Sometimes you may need to reset the matcher to clear this information after the pattern search has finished. The following methods are available to reset the resolver:
  • The method Matcher reset()resets the state of the matcher, including the position to be appended to the end (reset to 0). The next pattern search operation starts at the beginning of the matcher text. Returns a reference to the current object Matcher. For example, m.reset();resets the resolver referenced by m.

  • The method Matcher reset(CharSequence text)resets the resolver state and sets the new resolver text to text. The next pattern search operation begins at the beginning of the new matcher text. Returns a reference to the current object Matcher. For example, m.reset("new text");resets the referenced resolver mand sets the new resolver text to "new text".

Regular Expressions in Java, Part 3 - 3

Adding text to the end

The position of the matcher to be appended to the end specifies the beginning of the matcher text that is appended to the end of the object of type java.lang.StringBuffer. The following methods use this position:
  • The method Matcher appendReplacement(StringBuffer sb, String replacement)reads the matcher text characters and appends them to the end of the object StringBufferreferenced by the argument sb. This method stops reading at the last character preceding the previous pattern match. Next, the method appends the characters from the object of type Stringreferenced by the argument replacementto the end of the object StringBuffer(the string replacementmay contain references to text sequences captured during the previous search; these are specified using the characters ($)and group numbers being captured). Finally, the method sets the value of the matcher position to be appended to the position of the last matched character plus one, and then returns a reference to the current matcher.

  • The method Matcher appendReplacement(StringBuffer sb, String replacement)throws an exception java.lang.IllegalStateExceptionif the matcher has not yet found a match or a previous search attempt failed. It throws an exception IndexOutOfBoundsExceptionif the line replacementspecifies a capture group that is not in the pattern).

  • The method StringBuffer appendTail(StringBuffer sb)adds all the text to an object StringBufferand returns a reference to that object. After the last method call appendReplacement(StringBuffer sb, String replacement), call the method appendTail(StringBuffer sb)to copy the remaining text to the object StringBuffer.

Captured groups
As you remember from Part 1, a capture group is a sequence of characters enclosed in parentheses ( ()) metacharacters. The purpose of this construct is to store the found characters for later reuse during pattern matching. All characters from the captured group are considered as a single whole during the pattern search.
The following code calls the appendReplacement(StringBuffer sb, String replacement)and methods appendTail(StringBuffer sbto replace all occurrences of the character sequence in the source text catwith caterpillar:
Pattern p = Pattern.compile("(cat)");
Matcher m = p.matcher("one cat, two cats, or three cats on a fence");
StringBuffer sb = new StringBuffer();
while (m.find())
   m.appendReplacement(sb, "$1erpillar");
m.appendTail(sb);
System.out.println(sb);
Using a captured group and a reference to it in the replacement text tells the program to insert erpillarafter each occurrence of cat. The result of executing this code looks like this: one caterpillar, two caterpillars, or three caterpillars on a fence

Replacing text

The class Matcherprovides us with two methods for text replacement, complementary to the appendReplacement(StringBuffer sb, String replacement). Using these methods, you can replace either the first occurrence of the [replaced text] or all occurrences:
  • The method String replaceFirst(String replacement)resets the matcher, creates a new object String, copies all the characters of the matcher text (up to the first match) to this string, appends the characters from to the end of it replacement, copies the remaining characters to the string and returns an object String(the string replacementcan contain references to those captured during the previous search text sequences using dollar symbols and captured group numbers).

  • The method String replaceAll(String replacement)operates similarly to the method String replaceFirst(String replacement), but replaces replacementall found matches with characters from the string.

A regular expression \s+searches for one or more whitespace characters in the input text. Below, we will use this regular expression and call a method replaceAll(String replacement)to remove duplicate spaces:
Pattern p = Pattern.compile("\\s+");
Matcher m = p.matcher("Удаляем      \t\t лишние пробелы.   ");
System.out.println(m.replaceAll(" "));
Here are the results: Удаляем лишние пробелы. Regular Expressions in Java, Part 4 Regular Expressions in Java, Part 5
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION