JavaRush /Java Blog /Random EN /Regular Expressions in Java (RegEx)

Regular Expressions in Java (RegEx)

Published in the Random EN group
Regular expressions are a topic that programmers, even experienced ones, often put off until later. However, most Java developers will sooner or later have to deal with text processing. Most often - with search operations in the text and editing. Without regular expressions, productive and compact program code associated with text processing is simply unthinkable. So stop putting it off, let’s deal with the “regulars” right now. This is not such a difficult task.

What is RegEx regular expression?

In fact, a regular expression (RegEx in Java) is a pattern for searching for a string in text. In Java, the initial representation of this pattern is always a string, that is, an object of the String class. However, not any string can be compiled into a regular expression, only those that follow the rules for writing a regular expression - the syntax defined in the language specification. To write a regular expression, alphabetic and numeric characters are used, as well as metacharacters - characters that have a special meaning in the syntax of regular expressions. For example:
String regex = "java"; // string template "java";
String regex = "\\d{3}"; // string template of three numeric characters;

Creating Regular Expressions in Java

To create a RegEx in Java, you need to follow two simple steps:
  1. write it as a string using regular expression syntax;
  2. compile this string into a regular expression;
Working with regular expressions in any Java program begins with creating a class object Pattern. To do this, you need to call one of the two static methods available in the class compile. The first method takes one argument - a string literal of a regular expression, and the second - plus another parameter that turns on the mode for comparing the template with text:
public static Pattern compile (String literal)
public static Pattern compile (String literal, int flags)
The list of possible parameter values flags​​is defined in the class Patternand is available to us as static class variables. For example:
Pattern pattern = Pattern.compile("java", Pattern.CASE_INSENSITIVE);//searching for matches with the pattern will be done case-insensitively.
Essentially, the class Patternis a regular expression constructor. Under the hood, the method compilecalls the class's private constructor Patternto create a compiled view. This method of creating a template instance is implemented with the goal of creating it as an immutable object. When creating, a syntax check of the regular expression is performed. If there are errors in the line, an exception is generated PatternSyntaxException.

Regular expression syntax

Regular expression syntax is based on the use of symbols <([{\^-=$!|]})?*+.>, which can be combined with alphabetic characters. Depending on their role, they can be divided into several groups:
1. Metacharacters for matching line boundaries or text
Metacharacter Purpose
^ start of line
$ end of line
\b word boundary
\B not a word limit
\A start of input
\G end of previous match
\Z end of input
\z end of input
2. Metacharacters for searching for character classes
Metacharacter Purpose
\d digital symbol
\D non-numeric character
\s space character
\S non-whitespace character
\w alphanumeric character or underscore
\W any character other than an alphabetic, numeric, or underscore
. any character
3. Metacharacters for searching for text editing symbols
Metacharacter Purpose
\t tab character
\n newline character
\r carriage return character
\f go to new page
\u0085 next line character
\u 2028 line separator character
\u 2029 paragraph separator symbol
4. Metacharacters for grouping characters
Metacharacter Purpose
[a B C] any of the above (a, b, or c)
[^abc] any other than those listed (not a, b, c)
[a-zA-Z] range merging (Latin characters a to z are case insensitive)
[ad[mp]] concatenation of characters (a to d and m to p)
[az&&[def]] intersection of symbols (symbols d,e,f)
[az&&[^bc]] subtracting characters (characters a, dz)
5. Metasymbols to indicate the number of characters - quantifiers. The quantifier always comes after a character or group of characters.
Metacharacter Purpose
? one or missing
* zero or more times
+ one or more times
{n} n times
{n,} n times or more
{n,m} no less than n times and no more than m times

Greedy quantifier mode

A special feature of quantifiers is the ability to use them in different modes: greedy, super-greedy and lazy. The extra-greedy mode is turned on by adding the symbol “ +” after the quantifier, and the lazy mode by adding the symbol “ ?“. For example:
"A.+a" // greedy mode
"A.++a" // over-greedy mode
"A.+?a" // lazy mode
Using this template as an example, let’s try to understand how quantifiers work in different modes. By default, the quantifier operates in greedy mode. This means that it looks for the longest possible match in the string. As a result of running this code:
public static void main(String[] args) {
    String text = "Egor Alla Alexander";
    Pattern pattern = Pattern.compile("A.+a");
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        System.out.println(text.substring(matcher.start(), matcher.end()));
    }
}
we will get the following output: Alla Alexa The search algorithm for a given pattern " А.+а" is performed in the following sequence:
  1. In the given pattern, the first character is the Russian letter character А. Matchermatches it against every character of the text, starting at position zero. At position zero in our text there is a symbol Е, so Matcherit goes through the characters in the text sequentially until it meets a match with the pattern. In our example, this is the symbol at position No. 5.

    Regular Expressions in Java - 2
  2. After a match is found with the first character of the pattern, Matcherit checks the match with the second character of the pattern. In our case, this is the symbol “ .”, which stands for any character.

    Regular Expressions in Java - 3

    In the sixth position is the letter symbol л. Of course, it matches the "any character" pattern.

  3. Matchermoves on to checking the next character from the pattern. In our template, it is specified using the “ .+” quantifier. Since the number of repetitions of “any character” in the pattern is one or more times, Matcherit takes the next character from the string in turn and checks it for compliance with the pattern, as long as the “any character” condition is met, in our example - until the end of the line ( from position No. 7 - No. 18 of the text).

    Regular Expressions in Java - 4

    In fact, Matcherit captures the entire line to the end - this is where its “greed” manifests itself.

  4. After Matcherreaching the end of the text and finishing checking for the “ А.+” part of the pattern, Matcher begins checking for the rest of the pattern - the letter character а. Since the text in the forward direction has ended, the check occurs in the reverse direction, starting from the last character:

    Regular Expressions in Java - 5
  5. Matcher"remembers" the number of repetitions in the pattern " .+" at which it reached the end of the text, so it reduces the number of repetitions by one and checks the pattern for the text until a match is found: Regular Expressions in Java - 6

Ultra-greedy quantifier mode

In super-greedy mode, the matcher works similarly to the greedy mode mechanism. The difference is that when you grab text to the end of the line, there is no search backwards. That is, the first three stages in the super-greedy mode will be similar to the greedy mode. After capturing the entire string, the matcher adds the rest of the pattern and compares it with the captured string. In our example, when executing the main method with the pattern " А.++а", no matches will be found. Regular Expressions in Java - 7

Lazy quantifier mode

  1. In this mode, at the initial stage, as in the greedy mode, a match is sought with the first character of the pattern:

    Regular Expressions in Java - 8
  2. Next, it looks for a match with the next character in the pattern - any character:

    Regular Expressions in Java - 9
  3. Unlike the greedy mode, the lazy mode searches for the shortest match in the text, so after finding a match with the second character of the pattern, which is specified by a dot and matches the character at position No. 6 of the text, it Matcherwill check if the text matches the rest of the pattern - the “ а” character.

    Regular Expressions in Java - 10
  4. Since a match with the pattern in the text was not found (at position No. 7 in the text there is the symbol “ л“), Matcherit adds another “any character” in the pattern, since it is specified as one or more times, and again compares the pattern with the text at positions with No. 5 to No. 8:

    Regular Expressions in Java - 11
  5. In our case, a match was found, but the end of the text has not yet been reached. Therefore, from position No. 9, the check begins by searching for the first character of the pattern using a similar algorithm and then repeats until the end of the text.

    Regular Expressions in Java - 12
As a result of the method, mainwhen using the " А.+?а" template, we will get the following result: Alla Alexa As can be seen from our example, when using different quantifier modes for the same template, we got different results. Therefore, it is necessary to take this feature into account and select the desired mode depending on the desired result during the search.

Escaping characters in regular expressions

Since a regular expression in Java, or more precisely its initial representation, is specified using a string literal, it is necessary to take into account the rules of the Java specification that relate to string literals. In particular, the backslash character " \" in string literals in Java source code is interpreted as an escape character that alerts the compiler that the character that follows it is a special character and must be interpreted in a special way. For example:
String s = "The root directory is \nWindows";//wrap Windows to a new line
String s = "The root directory is \u00A7Windows";//insert paragraph character before Windows
Therefore, in string literals that describe a regular expression and use the " \" character (for example, for metacharacters), it must be doubled so that the Java bytecode compiler does not interpret it differently. For example:
String regex = "\\s"; // template for searching for space characters
String regex = "\"Windows\""; // pattern to search for the string "Windows"
The double backslash character should also be used to escape special characters if we plan to use them as "regular" characters. For example:
String regex = "How\\?"; // template for searching the string "How?"

Methods of the Pattern class

The class Patternhas other methods for working with regular expressions: String pattern()– returns the original string representation of the regular expression from which the object was created Pattern:
Pattern pattern = Pattern.compile("abc");
System.out.println(Pattern.pattern())//"abc"
static boolean matches(String regex, CharSequence input)– allows you to check the regular expression passed in the regex parameter against the text passed in the parameter input. Returns: true – if the text matches the pattern; false – otherwise; Example:
System.out.println(Pattern.matches("A.+a","Alla"));//true
System.out.println(Pattern.matches("A.+a","Egor Alla Alexander"));//false
int flags()– returns the flagstemplate parameter values ​​that were set when it was created, or 0 if this parameter was not set. Example:
Pattern pattern = Pattern.compile("abc");
System.out.println(pattern.flags());// 0
Pattern pattern = Pattern.compile("abc",Pattern.CASE_INSENSITIVE);
System.out.println(pattern.flags());// 2
String[] split(CharSequence text, int limit)– splits the text passed as a parameter into an array of elements String. The parameter limitdetermines the maximum number of matches that are searched for in the text:
  • when limit>0– search for limit-1matches is performed;
  • at limit<0– searches for all matches in the text
  • when limit=0– searches for all matches in the text, while empty lines at the end of the array are discarded;
Example:
public static void main(String[] args) {
    String text = "Egor Alla Anna";
    Pattern pattern = Pattern.compile("\\s");
    String[] strings = pattern.split(text,2);
    for (String s : strings) {
        System.out.println(s);
    }
    System.out.println("---------");
    String[] strings1 = pattern.split(text);
    for (String s : strings1) {
        System.out.println(s);
    }
}
Console output: Egor Alla Anna -------- Egor Alla Anna We will consider another class method for creating an object Matcherbelow.

Matcher class methods

Matcheris a class from which an object is created to search for patterns. Matcher– this is a “search engine”, an “engine” of regular expressions. To search, he needs to be given two things: a search pattern and an “address” to search at. To create an object, Matcherthe following method is provided in the class Pattern: рublic Matcher matcher(CharSequence input) As an argument, the method takes a sequence of characters in which the search will be performed. These are objects of classes that implement the interface CharSequence. StringYou can pass not only , but also StringBuffer, StringBuilder, Segmentand as an argument CharBuffer. The search template is the class object Patternon which the method is called matcher. Example of creating a matcher:
Pattern p = Pattern.compile("a*b");// compiled the regular expression into a view
Matcher m = p.matcher("aaaaab");//created a search engine in the text “aaaaab” using the pattern "a*b"
Now, with the help of our “search engine,” we can search for matches, find out the position of the match in the text, and replace the text using class methods. The method boolean find()searches for the next match in the text with the pattern. Using this method and the loop operator, you can analyze the entire text according to the event model (carry out the necessary operations when an event occurs - finding a match in the text). For example, using the methods of this class, int start()you int end()can determine the positions of the match in the text, and using the methods String replaceFirst(String replacement), String replaceAll(String replacement)replace the matches in the text with another replacement text. Example:
public static void main(String[] args) {
    String text = "Egor Alla Anna";
    Pattern pattern = Pattern.compile("A.+?a");

    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        int start=matcher.start();
        int end=matcher.end();
        System.out.println("Match found" + text.substring(start,end) + " с "+ start + " By " + (end-1) + "position");
    }
    System.out.println(matcher.replaceFirst("Ira"));
    System.out.println(matcher.replaceAll("Olga"));
    System.out.println(text);
}
Program output: A match was found Alla from 5th to 8th positions A match was found Anna from 10th to 13th positions Egor Ira Anna Egor Olga Olga Egor Alla Anna From the example it is clear that the methods replaceFirstcreate replaceAlla new object String- a string, which is the source text in which matches with template are replaced with the text passed to the method as an argument. Moreover, the method replaceFirstreplaces only the first match, and replaceAllall matches in the test. The original text remains unchanged. The use of other class methods Matcher, as well as examples of regular expressions, can be found in this series of articles . The most common operations with regular expressions when working with text are from classes Patternand Matcherare built into the String. These are methods such as split, matches, replaceFirst, replaceAll. But in fact, "under the hood" they use the Patternand Matcher. Therefore, if you need to replace text or compare strings in a program without writing unnecessary code, use the methods of the String. If you need advanced capabilities, think about classes Patternand Matcher.

Conclusion

A regular expression is described in a Java program using strings that match a pattern defined by the rules. When the code runs, Java recompiles this string into a class object Patternand uses the class object Matcherto find matches in the text. As I said at the beginning, regular expressions are very often put aside for later, considered a difficult topic. However, if you understand the basics of syntax, metacharacters, escaping, and study examples of regular expressions, they turn out to be much simpler than they seem at first glance.
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION