JavaRush /Java Blog /Random EN /Regular Expressions in Java (RegEx)

Regular Expressions in Java (RegEx)

Published in the Random EN group
Regular expressions are a topic that programmers, even experienced ones, often put off until later. However, most Java developers will sooner or later have to deal with text processing. Most often - with search operations in the text and editing. Without regular expressions, productive and compact code related to word processing is simply unthinkable. So stop postponing, let's deal with the "regulars" right now. This is not such a difficult task.

What is a RegEx Regular Expression?

In fact, a regular expression (RegEx in Java) is a pattern for searching for a string in text. In Java, the initial representation of this template is always a string, that is, an object of class String. However, not every string can be compiled into a regular expression, but only one that follows the rules for writing a regular expression - the syntax defined in the language specification. To write a regular expression, alphabetic and numeric characters are used, as well as metacharacters - characters that have a special meaning in the syntax of regular expressions. For example:
String regex = "java"; // string template "java";
String regex = "\\d{3}"; // string template of three numeric characters;

Creating Regular Expressions in Java

To create a RegEx in Java, you need to follow two simple steps:
  1. write it as a string, taking into account the syntax of regular expressions;
  2. compile this string into a regular expression;
Working with regular expressions in any Java program starts with creating an object of the Pattern. To do this, you must call one of the two static methods available in the class compile. The first method takes one argument - a regular expression string literal, and the second one - plus an additional parameter that enables the pattern-to-text comparison mode:
public static Pattern compile (String literal)
public static Pattern compile (String literal, int flags)
The list of possible parameter values flags​​is defined in the class Patternand is available to us as static class variables. For example:
Pattern pattern = Pattern.compile("java", Pattern.CASE_INSENSITIVE);//searching for matches with the pattern will be done case-insensitively.
Essentially, a class Patternis a regular expression constructor. Under the hood, the method compilecalls the class's private constructor Patternto create the compiled view. This way of instantiating a template is implemented with the goal of creating it as an immutable object. When creating, a syntactic check of the regular expression is performed. If there are errors in the line, an exception is thrown PatternSyntaxException.

Regular Expression Syntax

The syntax of regular expressions is based on the use of characters <([{\^-=$!|]})?*+.>that can be combined with literal characters. Depending on the role, they can be divided into several groups:
1. Metacharacters for matching line or text boundaries
metacharacter Purpose
^ start of line
$ end of line
\b word boundary
\B not a word boundary
\A input start
\G end of previous match
\Z end of input
\z end of input
2. Metacharacters for searching character classes
metacharacter Purpose
\d digital symbol
\D non-numeric character
\s space character
\S non-blank character
\w alphanumeric character or underscore
\W any character other than alphabetic, numeric, or underscore
. any character
3. Metacharacters for searching text editing characters
metacharacter Purpose
\t tab character
\n newline character
\r carriage return character
\f transition to a new page
\u0085 next line character
\u 2028 line separator character
\u 2029 paragraph separator character
4. Metacharacters for grouping characters
metacharacter Purpose
[a B C] any of the above (a, b, or c)
[^abc] any other than those listed (not a, b, c)
[a-zA-Z] range merging (Latin characters from a to z case insensitive)
[ad[mp]] character concatenation (a to d and m to p)
[az&&[def]] character intersection (characters d,e,f)
[az&&[^bc]] character subtraction (characters a, dz)
5. Metacharacters for indicating the number of characters - quantifiers. The quantifier always follows a character or group of characters.
metacharacter Purpose
? one or missing
* zero or more times
+ one or more times
{n} n times
{n} n times or more
{n,m} at least n times and at most m times

Greedy quantifier mode

A feature of quantifiers is the ability to use them in different modes: greedy, super-greedy and lazy. The over-greedy mode is enabled by adding the “ +” character after the quantifier, and the lazy mode is enabled by the “ ?” character. For example:
"A.+a" // greedy mode
"A.++a" // over-greedy mode
"A.+?a" // lazy mode
Let's use this pattern as an example to understand how quantifiers work in different modes. By default, the quantifier works in greedy mode. This means that it looks for the longest possible match in the string. As a result of executing this code:
public static void main(String[] args) {
    String text = "Egor Alla Alexander";
    Pattern pattern = Pattern.compile("A.+a");
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        System.out.println(text.substring(matcher.start(), matcher.end()));
    }
}
we will get the following output: Alla Alexa The search algorithm for the given pattern " А.+а", is executed in the following sequence:
  1. In the given pattern, the first character is the Russian character of the letter А. Matchermatches it with every character in the text, starting at position zero. At the zero position in our text there is a character Е, so Matcherit goes through the characters in the text sequentially until it meets a match with the pattern. In our example, this is the character at position #5.

    Regular Expressions in Java - 2
  2. After a match is found with the first character of the pattern, Matchercheck the match against the second character of the pattern. In our case, this is the symbol " .", which stands for any character.

    Regular Expressions in Java - 3

    In the sixth position - the symbol of the letter л. Of course, it matches the "any character" pattern.

  3. Matcherproceeds to check the next character from the pattern. In our template, it is specified using the “ .+” quantifier. Since the number of repetitions of "any character" in the pattern is one or more times, Matcherit takes the next character in turn from the string and checks it against the pattern until the "any character" condition is met, in our example - until the end of the string ( from pos. No. 7 - No. 18 of the text).

    Regular Expressions in Java - 4

    In fact, Matcher, captures the entire line to the end - this is precisely its “greed”.

  4. After Matcherit has reached the end of the text and finished checking for the " А.+" part of the pattern, Matcher starts checking for the rest of the pattern - the letter character а. Since the text in the forward direction has ended, the check occurs in the reverse direction, starting from the last character:

    Regular Expressions in Java - 5
  5. Matcher"remembers" the number of repetitions in the pattern " .+" at which it reached the end of the text, so it reduces the number of repetitions by one and checks if the pattern matches the text until a match is found: Regular Expressions in Java - 6

Overgreedy quantifier mode

In super-greedy mode, the operation of the matcher is similar to the mechanism of the greedy mode. The difference is that when capturing text to the end of the line, the search in the opposite direction does not occur. That is, the first three stages in the supergreedy mode will be similar to the greedy mode. After capturing the entire string, the matcher adds the remainder of the pattern and compares with the captured string. In our example, executing the main method with the pattern " А.++а" will not match. Regular Expressions in Java - 7

Lazy quantifier mode

  1. In this mode, at the initial stage, as in the greedy mode, a match is searched for the first character of the pattern:

    Regular Expressions in Java - 8
  2. Next, it looks for a match with the next pattern character - any character:

    Regular Expressions in Java - 9
  3. Unlike the greedy mode, the lazy one looks for the shortest match in the text, so after finding a match with the second character of the pattern, which is given by a dot and matches the character at position #6 of the text, it will check if the text matches the rest of the pattern - the character Matcher" а"

    Regular Expressions in Java - 10
  4. Since no match with the pattern was found in the text (the character “ ” is at position #7 in the text л), Matcherit adds another “any character” in the pattern, since it is specified as one or more times, and again compares the pattern with the text at positions with #5 to #8:

    Regular Expressions in Java - 11
  5. In our case, a match is found, but the end of the text has not yet been reached. Therefore, from position No. 9, the check begins with the search for the first character of the pattern using a similar algorithm and then repeats until the end of the text.

    Regular Expressions in Java - 12
As a result of the method, mainwhen using the " А.+?а" template, we will get the following result: Alla Alexa As you can see from our example, when using different quantifier modes for the same template, we got different results. Therefore, it is necessary to take this feature into account and select the desired mode depending on the desired result during the search.

Escaping characters in regular expressions

Since a regular expression in Java, or rather, its initial representation, is specified using a string literal, you must follow the rules of the Java specification that apply to string literals. In particular, the backslash character " \" in string literals in Java source code is interpreted as an escape character, which warns the compiler that the character that follows it is special and should be treated in a special way. For example:
String s = "The root directory is \nWindows";//wrap Windows to a new line
String s = "The root directory is \u00A7Windows";//insert paragraph character before Windows
Therefore, in string literals that describe a regular expression and use the " \" character (for example, for metacharacters), it must be doubled so that the Java bytecode compiler does not interpret it in its own way. For example:
String regex = "\\s"; // template for searching for space characters
String regex = "\"Windows\""; // pattern to search for the string "Windows"
The double backslash should also be used to escape characters used as special characters if we plan to use them as "regular" characters. For example:
String regex = "How\\?"; // template for searching the string "How?"

Methods of the Pattern class

The class Patternhas other methods for working with regular expressions: String pattern()- returns the original string representation of the regular expression from which the object was created Pattern:
Pattern pattern = Pattern.compile("abc");
System.out.println(Pattern.pattern())//"abc"
static boolean matches(String regex, CharSequence input)– allows you to check the regular expression passed in the regex parameter against the text passed in the input. Returns: true - if the text matches the pattern; false - otherwise; Example:
System.out.println(Pattern.matches("A.+a","Alla"));//true
System.out.println(Pattern.matches("A.+a","Egor Alla Alexander"));//false
int flags()– returns the values ​​of flagsthe template parameter that were set when it was created, or 0 if this parameter was not set. Example:
Pattern pattern = Pattern.compile("abc");
System.out.println(pattern.flags());// 0
Pattern pattern = Pattern.compile("abc",Pattern.CASE_INSENSITIVE);
System.out.println(pattern.flags());// 2
String[] split(CharSequence text, int limit)– splits the text passed as a parameter into an array of elements String. The parameter limitdetermines the maximum number of matches that are searched for in the text:
  • when limit>0- search for limit-1matches;
  • when limit<0- searches for all matches in the text
  • when limit=0- searches for all matches in the text, while empty lines at the end of the array are discarded;
Example:
public static void main(String[] args) {
    String text = "Egor Alla Anna";
    Pattern pattern = Pattern.compile("\\s");
    String[] strings = pattern.split(text,2);
    for (String s : strings) {
        System.out.println(s);
    }
    System.out.println("---------");
    String[] strings1 = pattern.split(text);
    for (String s : strings1) {
        System.out.println(s);
    }
}
Console output: Egor Alla Anna -------- Egor Alla Anna Let's consider another class method for creating an object Matcherbelow.

Methods of the Matcher class

Matcheris a class from which an object is created for pattern matching. Matcher- this is a "search engine", "engine" of regular expressions. To search, he needs to be given two things: a search pattern and an “address” to search for. To create an object, Matcherthe following method is provided in the class Pattern: рublic Matcher matcher(CharSequence input) The method takes as an argument a sequence of characters in which the search will be performed. These are objects of classes that implement the CharSequence. as an argument, you can pass not only String, but also StringBuffer, StringBuilder, Segmentand CharBuffer. The search pattern is the class object Patternon which the method is called matcher. Example of creating a matcher:
Pattern p = Pattern.compile("a*b");// compiled the regular expression into a view
Matcher m = p.matcher("aaaaab");//created a search engine in the text “aaaaab” using the pattern "a*b"
Now, with the help of our "search engine", we can search for matches, find out the position of the match in the text, replace the text using class methods. The method boolean find()looks for the next match in the text with the pattern. Using this method and the loop operator, you can analyze the entire text according to the event model (perform the necessary operations when an event occurs - finding a match in the text). For example, using the int start()and methods of this class int end(), you can determine the positions of a match in the text, and using the String replaceFirst(String replacement)and methods String replaceAll(String replacement), replace matches in the text with another replacement text. Example:
public static void main(String[] args) {
    String text = "Egor Alla Anna";
    Pattern pattern = Pattern.compile("A.+?a");

    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        int start=matcher.start();
        int end=matcher.end();
        System.out.println("Match found" + text.substring(start,end) + " с "+ start + " By " + (end-1) + "position");
    }
    System.out.println(matcher.replaceFirst("Ira"));
    System.out.println(matcher.replaceAll("Olga"));
    System.out.println(text);
}
Program output: Match found Alla from position 5 to 8 Match found Anna from position 10 to 13 Egor Ira Anna Egor Olga Olga Egor Alla Anna template are replaced with the text passed to the method as an argument. Moreover, the method replaces only the first match, and - all matches in the test. The original text remains unchanged. Use of other class methods , as well as examples of regular expressions can be found in this series of articles . The most frequent operations with regular expressions when working with text from classes and embedded in a classreplaceFirstreplaceAllStringreplaceFirstreplaceAllMatcherPatternMatcherString. These are methods such as split, matches, replaceFirst, replaceAll. But in fact, "under the hood" they use the Patternand Matcher. Therefore, if you need to replace text or compare strings in a program without writing extra code, use the methods of the String. If you need advanced features, think about the classes Patternand Matcher.

Conclusion

A regular expression is described in a Java program using strings that match the pattern defined by the rules. When the code is executed, Java recompiles this string into a class object Patternand uses the class object Matcherto find matches in the text. As I said at the beginning, regular expressions are very often put off until later, considered a difficult topic. However, if you understand the basics of syntax, metacharacters, escaping, and study regular expression examples, they are much easier than they seem at first glance.
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION