What is RegEx regular expression?
In fact, a regular expression (RegEx in Java) is a pattern for searching for a string in text. In Java, the initial representation of this pattern is always a string, that is, an object of the String class. However, not any string can be compiled into a regular expression, only those that follow the rules for writing a regular expression - the syntax defined in the language specification. To write a regular expression, alphabetic and numeric characters are used, as well as metacharacters - characters that have a special meaning in the syntax of regular expressions. For example:String regex = "java"; // string template "java";
String regex = "\\d{3}"; // string template of three numeric characters;
Creating Regular Expressions in Java
To create a RegEx in Java, you need to follow two simple steps:- write it as a string using regular expression syntax;
- compile this string into a regular expression;
Pattern
. To do this, you need to call one of the two static methods available in the class compile
. The first method takes one argument - a string literal of a regular expression, and the second - plus another parameter that turns on the mode for comparing the template with text:
public static Pattern compile (String literal)
public static Pattern compile (String literal, int flags)
The list of possible parameter values flags
is defined in the class Pattern
and is available to us as static class variables. For example:
Pattern pattern = Pattern.compile("java", Pattern.CASE_INSENSITIVE);//searching for matches with the pattern will be done case-insensitively.
Essentially, the class Pattern
is a regular expression constructor. Under the hood, the method compile
calls the class's private constructor Pattern
to create a compiled view. This method of creating a template instance is implemented with the goal of creating it as an immutable object. When creating, a syntax check of the regular expression is performed. If there are errors in the line, an exception is generated PatternSyntaxException
.
Regular expression syntax
Regular expression syntax is based on the use of symbols<([{\^-=$!|]})?*+.>
, which can be combined with alphabetic characters. Depending on their role, they can be divided into several groups:
Metacharacter | Purpose |
---|---|
^ | start of line |
$ | end of line |
\b | word boundary |
\B | not a word limit |
\A | start of input |
\G | end of previous match |
\Z | end of input |
\z | end of input |
Metacharacter | Purpose |
---|---|
\d | digital symbol |
\D | non-numeric character |
\s | space character |
\S | non-whitespace character |
\w | alphanumeric character or underscore |
\W | any character other than an alphabetic, numeric, or underscore |
. | any character |
Metacharacter | Purpose |
---|---|
\t | tab character |
\n | newline character |
\r | carriage return character |
\f | go to new page |
\u0085 | next line character |
\u 2028 | line separator character |
\u 2029 | paragraph separator symbol |
Metacharacter | Purpose |
---|---|
[a B C] | any of the above (a, b, or c) |
[^abc] | any other than those listed (not a, b, c) |
[a-zA-Z] | range merging (Latin characters a to z are case insensitive) |
[ad[mp]] | concatenation of characters (a to d and m to p) |
[az&&[def]] | intersection of symbols (symbols d,e,f) |
[az&&[^bc]] | subtracting characters (characters a, dz) |
Metacharacter | Purpose |
---|---|
? | one or missing |
* | zero or more times |
+ | one or more times |
{n} | n times |
{n,} | n times or more |
{n,m} | no less than n times and no more than m times |
Greedy quantifier mode
A special feature of quantifiers is the ability to use them in different modes: greedy, super-greedy and lazy. The extra-greedy mode is turned on by adding the symbol “+
” after the quantifier, and the lazy mode by adding the symbol “ ?
“. For example:
"A.+a" // greedy mode
"A.++a" // over-greedy mode
"A.+?a" // lazy mode
Using this template as an example, let’s try to understand how quantifiers work in different modes. By default, the quantifier operates in greedy mode. This means that it looks for the longest possible match in the string. As a result of running this code:
public static void main(String[] args) {
String text = "Egor Alla Alexander";
Pattern pattern = Pattern.compile("A.+a");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(text.substring(matcher.start(), matcher.end()));
}
}
we will get the following output: Alla Alexa The search algorithm for a given pattern " А.+а
" is performed in the following sequence:
-
In the given pattern, the first character is the Russian letter character
А
.Matcher
matches it against every character of the text, starting at position zero. At position zero in our text there is a symbolЕ
, soMatcher
it goes through the characters in the text sequentially until it meets a match with the pattern. In our example, this is the symbol at position No. 5. -
After a match is found with the first character of the pattern,
Matcher
it checks the match with the second character of the pattern. In our case, this is the symbol “.
”, which stands for any character.In the sixth position is the letter symbol
л
. Of course, it matches the "any character" pattern. -
Matcher
moves on to checking the next character from the pattern. In our template, it is specified using the “.+
” quantifier. Since the number of repetitions of “any character” in the pattern is one or more times,Matcher
it takes the next character from the string in turn and checks it for compliance with the pattern, as long as the “any character” condition is met, in our example - until the end of the line ( from position No. 7 - No. 18 of the text).In fact,
Matcher
it captures the entire line to the end - this is where its “greed” manifests itself. -
After
Matcher
reaching the end of the text and finishing checking for the “А.+
” part of the pattern, Matcher begins checking for the rest of the pattern - the letter characterа
. Since the text in the forward direction has ended, the check occurs in the reverse direction, starting from the last character: -
Matcher
"remembers" the number of repetitions in the pattern ".+
" at which it reached the end of the text, so it reduces the number of repetitions by one and checks the pattern for the text until a match is found:
Ultra-greedy quantifier mode
In super-greedy mode, the matcher works similarly to the greedy mode mechanism. The difference is that when you grab text to the end of the line, there is no search backwards. That is, the first three stages in the super-greedy mode will be similar to the greedy mode. After capturing the entire string, the matcher adds the rest of the pattern and compares it with the captured string. In our example, when executing the main method with the pattern "А.++а
", no matches will be found.
Lazy quantifier mode
-
In this mode, at the initial stage, as in the greedy mode, a match is sought with the first character of the pattern:
-
Next, it looks for a match with the next character in the pattern - any character:
-
Unlike the greedy mode, the lazy mode searches for the shortest match in the text, so after finding a match with the second character of the pattern, which is specified by a dot and matches the character at position No. 6 of the text, it
Matcher
will check if the text matches the rest of the pattern - the “а
” character. -
Since a match with the pattern in the text was not found (at position No. 7 in the text there is the symbol “
л
“),Matcher
it adds another “any character” in the pattern, since it is specified as one or more times, and again compares the pattern with the text at positions with No. 5 to No. 8: -
In our case, a match was found, but the end of the text has not yet been reached. Therefore, from position No. 9, the check begins by searching for the first character of the pattern using a similar algorithm and then repeats until the end of the text.
main
when using the " А.+?а
" template, we will get the following result: Alla Alexa As can be seen from our example, when using different quantifier modes for the same template, we got different results. Therefore, it is necessary to take this feature into account and select the desired mode depending on the desired result during the search.
Escaping characters in regular expressions
Since a regular expression in Java, or more precisely its initial representation, is specified using a string literal, it is necessary to take into account the rules of the Java specification that relate to string literals. In particular, the backslash character "\
" in string literals in Java source code is interpreted as an escape character that alerts the compiler that the character that follows it is a special character and must be interpreted in a special way. For example:
String s = "The root directory is \nWindows";//wrap Windows to a new line
String s = "The root directory is \u00A7Windows";//insert paragraph character before Windows
Therefore, in string literals that describe a regular expression and use the " \
" character (for example, for metacharacters), it must be doubled so that the Java bytecode compiler does not interpret it differently. For example:
String regex = "\\s"; // template for searching for space characters
String regex = "\"Windows\""; // pattern to search for the string "Windows"
The double backslash character should also be used to escape special characters if we plan to use them as "regular" characters. For example:
String regex = "How\\?"; // template for searching the string "How?"
Methods of the Pattern class
The classPattern
has other methods for working with regular expressions: String pattern()
– returns the original string representation of the regular expression from which the object was created Pattern
:
Pattern pattern = Pattern.compile("abc");
System.out.println(Pattern.pattern())//"abc"
static boolean matches(String regex, CharSequence input)
– allows you to check the regular expression passed in the regex parameter against the text passed in the parameter input
. Returns: true – if the text matches the pattern; false – otherwise; Example:
System.out.println(Pattern.matches("A.+a","Alla"));//true
System.out.println(Pattern.matches("A.+a","Egor Alla Alexander"));//false
int flags()
– returns the flags
template parameter values that were set when it was created, or 0 if this parameter was not set. Example:
Pattern pattern = Pattern.compile("abc");
System.out.println(pattern.flags());// 0
Pattern pattern = Pattern.compile("abc",Pattern.CASE_INSENSITIVE);
System.out.println(pattern.flags());// 2
String[] split(CharSequence text, int limit)
– splits the text passed as a parameter into an array of elements String
. The parameter limit
determines the maximum number of matches that are searched for in the text:
- when
limit>0
– search forlimit-1
matches is performed; - at
limit<0
– searches for all matches in the text - when
limit=0
– searches for all matches in the text, while empty lines at the end of the array are discarded;
public static void main(String[] args) {
String text = "Egor Alla Anna";
Pattern pattern = Pattern.compile("\\s");
String[] strings = pattern.split(text,2);
for (String s : strings) {
System.out.println(s);
}
System.out.println("---------");
String[] strings1 = pattern.split(text);
for (String s : strings1) {
System.out.println(s);
}
}
Console output: Egor Alla Anna -------- Egor Alla Anna We will consider another class method for creating an object Matcher
below.
Matcher class methods
Matcher
is a class from which an object is created to search for patterns. Matcher
– this is a “search engine”, an “engine” of regular expressions. To search, he needs to be given two things: a search pattern and an “address” to search at. To create an object, Matcher
the following method is provided in the class Pattern
: рublic Matcher matcher(CharSequence input)
As an argument, the method takes a sequence of characters in which the search will be performed. These are objects of classes that implement the interface CharSequence
. String
You can pass not only , but also StringBuffer
, StringBuilder
, Segment
and as an argument CharBuffer
. The search template is the class object Pattern
on which the method is called matcher
. Example of creating a matcher:
Pattern p = Pattern.compile("a*b");// compiled the regular expression into a view
Matcher m = p.matcher("aaaaab");//created a search engine in the text “aaaaab” using the pattern "a*b"
Now, with the help of our “search engine,” we can search for matches, find out the position of the match in the text, and replace the text using class methods. The method boolean find()
searches for the next match in the text with the pattern. Using this method and the loop operator, you can analyze the entire text according to the event model (carry out the necessary operations when an event occurs - finding a match in the text). For example, using the methods of this class, int start()
you int end()
can determine the positions of the match in the text, and using the methods String replaceFirst(String replacement)
, String replaceAll(String replacement)
replace the matches in the text with another replacement text. Example:
public static void main(String[] args) {
String text = "Egor Alla Anna";
Pattern pattern = Pattern.compile("A.+?a");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
int start=matcher.start();
int end=matcher.end();
System.out.println("Match found" + text.substring(start,end) + " с "+ start + " By " + (end-1) + "position");
}
System.out.println(matcher.replaceFirst("Ira"));
System.out.println(matcher.replaceAll("Olga"));
System.out.println(text);
}
Program output: A match was found Alla from 5th to 8th positions A match was found Anna from 10th to 13th positions Egor Ira Anna Egor Olga Olga Egor Alla Anna From the example it is clear that the methods replaceFirst
create replaceAll
a new object String
- a string, which is the source text in which matches with template are replaced with the text passed to the method as an argument. Moreover, the method replaceFirst
replaces only the first match, and replaceAll
all matches in the test. The original text remains unchanged. The use of other class methods Matcher
, as well as examples of regular expressions, can be found in this series of articles . The most common operations with regular expressions when working with text are from classes Pattern
and Matcher
are built into the String
. These are methods such as split
, matches
, replaceFirst
, replaceAll
. But in fact, "under the hood" they use the Pattern
and Matcher
. Therefore, if you need to replace text or compare strings in a program without writing unnecessary code, use the methods of the String
. If you need advanced capabilities, think about classes Pattern
and Matcher
.
Conclusion
A regular expression is described in a Java program using strings that match a pattern defined by the rules. When the code runs, Java recompiles this string into a class objectPattern
and uses the class object Matcher
to find matches in the text. As I said at the beginning, regular expressions are very often put aside for later, considered a difficult topic. However, if you understand the basics of syntax, metacharacters, escaping, and study examples of regular expressions, they turn out to be much simpler than they seem at first glance.