What is a RegEx Regular Expression?
In fact, a regular expression (RegEx in Java) is a pattern for searching for a string in text. In Java, the initial representation of this template is always a string, that is, an object of class String. However, not every string can be compiled into a regular expression, but only one that follows the rules for writing a regular expression - the syntax defined in the language specification. To write a regular expression, alphabetic and numeric characters are used, as well as metacharacters - characters that have a special meaning in the syntax of regular expressions. For example:String regex = "java"; // string template "java";
String regex = "\\d{3}"; // string template of three numeric characters;
Creating Regular Expressions in Java
To create a RegEx in Java, you need to follow two simple steps:- write it as a string, taking into account the syntax of regular expressions;
- compile this string into a regular expression;
Pattern
. To do this, you must call one of the two static methods available in the class compile
. The first method takes one argument - a regular expression string literal, and the second one - plus an additional parameter that enables the pattern-to-text comparison mode:
public static Pattern compile (String literal)
public static Pattern compile (String literal, int flags)
The list of possible parameter values flags
is defined in the class Pattern
and is available to us as static class variables. For example:
Pattern pattern = Pattern.compile("java", Pattern.CASE_INSENSITIVE);//searching for matches with the pattern will be done case-insensitively.
Essentially, a class Pattern
is a regular expression constructor. Under the hood, the method compile
calls the class's private constructor Pattern
to create the compiled view. This way of instantiating a template is implemented with the goal of creating it as an immutable object. When creating, a syntactic check of the regular expression is performed. If there are errors in the line, an exception is thrown PatternSyntaxException
.
Regular Expression Syntax
The syntax of regular expressions is based on the use of characters<([{\^-=$!|]})?*+.>
that can be combined with literal characters. Depending on the role, they can be divided into several groups:
metacharacter | Purpose |
---|---|
^ | start of line |
$ | end of line |
\b | word boundary |
\B | not a word boundary |
\A | input start |
\G | end of previous match |
\Z | end of input |
\z | end of input |
metacharacter | Purpose |
---|---|
\d | digital symbol |
\D | non-numeric character |
\s | space character |
\S | non-blank character |
\w | alphanumeric character or underscore |
\W | any character other than alphabetic, numeric, or underscore |
. | any character |
metacharacter | Purpose |
---|---|
\t | tab character |
\n | newline character |
\r | carriage return character |
\f | transition to a new page |
\u0085 | next line character |
\u 2028 | line separator character |
\u 2029 | paragraph separator character |
metacharacter | Purpose |
---|---|
[a B C] | any of the above (a, b, or c) |
[^abc] | any other than those listed (not a, b, c) |
[a-zA-Z] | range merging (Latin characters from a to z case insensitive) |
[ad[mp]] | character concatenation (a to d and m to p) |
[az&&[def]] | character intersection (characters d,e,f) |
[az&&[^bc]] | character subtraction (characters a, dz) |
metacharacter | Purpose |
---|---|
? | one or missing |
* | zero or more times |
+ | one or more times |
{n} | n times |
{n} | n times or more |
{n,m} | at least n times and at most m times |
Greedy quantifier mode
A feature of quantifiers is the ability to use them in different modes: greedy, super-greedy and lazy. The over-greedy mode is enabled by adding the “+
” character after the quantifier, and the lazy mode is enabled by the “ ?
” character. For example:
"A.+a" // greedy mode
"A.++a" // over-greedy mode
"A.+?a" // lazy mode
Let's use this pattern as an example to understand how quantifiers work in different modes. By default, the quantifier works in greedy mode. This means that it looks for the longest possible match in the string. As a result of executing this code:
public static void main(String[] args) {
String text = "Egor Alla Alexander";
Pattern pattern = Pattern.compile("A.+a");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(text.substring(matcher.start(), matcher.end()));
}
}
we will get the following output: Alla Alexa The search algorithm for the given pattern " А.+а
", is executed in the following sequence:
-
In the given pattern, the first character is the Russian character of the letter
А
.Matcher
matches it with every character in the text, starting at position zero. At the zero position in our text there is a characterЕ
, soMatcher
it goes through the characters in the text sequentially until it meets a match with the pattern. In our example, this is the character at position #5. -
After a match is found with the first character of the pattern,
Matcher
check the match against the second character of the pattern. In our case, this is the symbol ".
", which stands for any character.In the sixth position - the symbol of the letter
л
. Of course, it matches the "any character" pattern. -
Matcher
proceeds to check the next character from the pattern. In our template, it is specified using the “.+
” quantifier. Since the number of repetitions of "any character" in the pattern is one or more times,Matcher
it takes the next character in turn from the string and checks it against the pattern until the "any character" condition is met, in our example - until the end of the string ( from pos. No. 7 - No. 18 of the text).In fact,
Matcher
, captures the entire line to the end - this is precisely its “greed”. -
After
Matcher
it has reached the end of the text and finished checking for the "А.+
" part of the pattern, Matcher starts checking for the rest of the pattern - the letter characterа
. Since the text in the forward direction has ended, the check occurs in the reverse direction, starting from the last character: -
Matcher
"remembers" the number of repetitions in the pattern ".+
" at which it reached the end of the text, so it reduces the number of repetitions by one and checks if the pattern matches the text until a match is found:
Overgreedy quantifier mode
In super-greedy mode, the operation of the matcher is similar to the mechanism of the greedy mode. The difference is that when capturing text to the end of the line, the search in the opposite direction does not occur. That is, the first three stages in the supergreedy mode will be similar to the greedy mode. After capturing the entire string, the matcher adds the remainder of the pattern and compares with the captured string. In our example, executing the main method with the pattern "А.++а
" will not match.
Lazy quantifier mode
-
In this mode, at the initial stage, as in the greedy mode, a match is searched for the first character of the pattern:
-
Next, it looks for a match with the next pattern character - any character:
-
Unlike the greedy mode, the lazy one looks for the shortest match in the text, so after finding a match with the second character of the pattern, which is given by a dot and matches the character at position #6 of the text, it will check if the text matches the rest of the pattern - the character
Matcher
"а
" -
Since no match with the pattern was found in the text (the character “ ” is at position #7 in the text
л
),Matcher
it adds another “any character” in the pattern, since it is specified as one or more times, and again compares the pattern with the text at positions with #5 to #8: -
In our case, a match is found, but the end of the text has not yet been reached. Therefore, from position No. 9, the check begins with the search for the first character of the pattern using a similar algorithm and then repeats until the end of the text.
main
when using the " А.+?а
" template, we will get the following result: Alla Alexa As you can see from our example, when using different quantifier modes for the same template, we got different results. Therefore, it is necessary to take this feature into account and select the desired mode depending on the desired result during the search.
Escaping characters in regular expressions
Since a regular expression in Java, or rather, its initial representation, is specified using a string literal, you must follow the rules of the Java specification that apply to string literals. In particular, the backslash character "\
" in string literals in Java source code is interpreted as an escape character, which warns the compiler that the character that follows it is special and should be treated in a special way. For example:
String s = "The root directory is \nWindows";//wrap Windows to a new line
String s = "The root directory is \u00A7Windows";//insert paragraph character before Windows
Therefore, in string literals that describe a regular expression and use the " \
" character (for example, for metacharacters), it must be doubled so that the Java bytecode compiler does not interpret it in its own way. For example:
String regex = "\\s"; // template for searching for space characters
String regex = "\"Windows\""; // pattern to search for the string "Windows"
The double backslash should also be used to escape characters used as special characters if we plan to use them as "regular" characters. For example:
String regex = "How\\?"; // template for searching the string "How?"
Methods of the Pattern class
The classPattern
has other methods for working with regular expressions: String pattern()
- returns the original string representation of the regular expression from which the object was created Pattern
:
Pattern pattern = Pattern.compile("abc");
System.out.println(Pattern.pattern())//"abc"
static boolean matches(String regex, CharSequence input)
– allows you to check the regular expression passed in the regex parameter against the text passed in the input
. Returns: true - if the text matches the pattern; false - otherwise; Example:
System.out.println(Pattern.matches("A.+a","Alla"));//true
System.out.println(Pattern.matches("A.+a","Egor Alla Alexander"));//false
int flags()
– returns the values of flags
the template parameter that were set when it was created, or 0 if this parameter was not set. Example:
Pattern pattern = Pattern.compile("abc");
System.out.println(pattern.flags());// 0
Pattern pattern = Pattern.compile("abc",Pattern.CASE_INSENSITIVE);
System.out.println(pattern.flags());// 2
String[] split(CharSequence text, int limit)
– splits the text passed as a parameter into an array of elements String
. The parameter limit
determines the maximum number of matches that are searched for in the text:
- when
limit>0
- search forlimit-1
matches; - when
limit<0
- searches for all matches in the text - when
limit=0
- searches for all matches in the text, while empty lines at the end of the array are discarded;
public static void main(String[] args) {
String text = "Egor Alla Anna";
Pattern pattern = Pattern.compile("\\s");
String[] strings = pattern.split(text,2);
for (String s : strings) {
System.out.println(s);
}
System.out.println("---------");
String[] strings1 = pattern.split(text);
for (String s : strings1) {
System.out.println(s);
}
}
Console output: Egor Alla Anna -------- Egor Alla Anna Let's consider another class method for creating an object Matcher
below.
Methods of the Matcher class
Matcher
is a class from which an object is created for pattern matching. Matcher
- this is a "search engine", "engine" of regular expressions. To search, he needs to be given two things: a search pattern and an “address” to search for. To create an object, Matcher
the following method is provided in the class Pattern
: рublic Matcher matcher(CharSequence input)
The method takes as an argument a sequence of characters in which the search will be performed. These are objects of classes that implement the CharSequence
. as an argument, you can pass not only String
, but also StringBuffer
, StringBuilder
, Segment
and CharBuffer
. The search pattern is the class object Pattern
on which the method is called matcher
. Example of creating a matcher:
Pattern p = Pattern.compile("a*b");// compiled the regular expression into a view
Matcher m = p.matcher("aaaaab");//created a search engine in the text “aaaaab” using the pattern "a*b"
Now, with the help of our "search engine", we can search for matches, find out the position of the match in the text, replace the text using class methods. The method boolean find()
looks for the next match in the text with the pattern. Using this method and the loop operator, you can analyze the entire text according to the event model (perform the necessary operations when an event occurs - finding a match in the text). For example, using the int start()
and methods of this class int end()
, you can determine the positions of a match in the text, and using the String replaceFirst(String replacement)
and methods String replaceAll(String replacement)
, replace matches in the text with another replacement text. Example:
public static void main(String[] args) {
String text = "Egor Alla Anna";
Pattern pattern = Pattern.compile("A.+?a");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
int start=matcher.start();
int end=matcher.end();
System.out.println("Match found" + text.substring(start,end) + " с "+ start + " By " + (end-1) + "position");
}
System.out.println(matcher.replaceFirst("Ira"));
System.out.println(matcher.replaceAll("Olga"));
System.out.println(text);
}
Program output: Match found Alla from position 5 to 8 Match found Anna from position 10 to 13 Egor Ira Anna Egor Olga Olga Egor Alla Anna template are replaced with the text passed to the method as an argument. Moreover, the method replaces only the first match, and - all matches in the test. The original text remains unchanged. Use of other class methods , as well as examples of regular expressions can be found in this series of articles . The most frequent operations with regular expressions when working with text from classes and embedded in a classreplaceFirst
replaceAll
String
replaceFirst
replaceAll
Matcher
Pattern
Matcher
String
. These are methods such as split
, matches
, replaceFirst
, replaceAll
. But in fact, "under the hood" they use the Pattern
and Matcher
. Therefore, if you need to replace text or compare strings in a program without writing extra code, use the methods of the String
. If you need advanced features, think about the classes Pattern
and Matcher
.
Conclusion
A regular expression is described in a Java program using strings that match the pattern defined by the rules. When the code is executed, Java recompiles this string into a class objectPattern
and uses the class object Matcher
to find matches in the text. As I said at the beginning, regular expressions are very often put off until later, considered a difficult topic. However, if you understand the basics of syntax, metacharacters, escaping, and study regular expression examples, they are much easier than they seem at first glance.
GO TO FULL VERSION