JavaRush/Java Blog/Random EN/Regular expressions in Java, part 1

Regular expressions in Java, part 1

Published in the Random EN group
members
We bring to your attention a translation of a short guide to regular expressions in the Java language, written by Jeff Friesen for the JavaWorld website . For ease of reading, we have divided the article into several parts. Regular Expressions in Java, Part 1 - 1

Using the Regular Expression API in Java Programs to Recognize and Describe Patterns

Java's character and various string data types provide low-level support for pattern matching, but using them for this purpose typically adds significant code complexity. Simpler and more performant code is obtained by using the Regex API ("Regular Expression API"). This tutorial will help you get started with regular expressions and the Regex API. We'll first discuss the three most interesting classes in the package in general java.util.regex, and then take a look inside the class Patternand explore its sophisticated pattern-matching constructs. Attention: You can download the source code (created by Jeff Friesen for the JavaWorld site) of the demo application from this article from here .

What are regular expressions?

A regular expression (regular expression/regex/regexp) is a string that is a pattern that describes a certain set of strings. The pattern determines which rows belong to the set. The pattern consists of literals and metacharacters—characters with a special meaning rather than a literal meaning. Pattern matching is a search of text to find matches, that is, strings that match a regular expression pattern. Java supports pattern matching through its Regex API. This API consists of three classes: Pattern, Matcherand PatternSyntaxException, located in the package java.util.regex:
  • class objects Pattern, also called templates, are compiled regular expressions.
  • class objects Matcher, or matchers, are pattern interpretation mechanisms for finding matches in character sequences (objects whose classes implement an interface java.lang.CharSequenceand serve as text sources).
  • Class objects PatternSyntaxExceptionare used to describe invalid regular expression patterns.
Java also provides support for pattern matching through various methods of the java.lang.String. For example, the function boolean matches (String regex)returns trueonly if the calling string matches the regular expression exactly regex.
Convenient Methods
matches()and other regular expression-oriented convenience methods of the class Stringare implemented under the hood in a similar way to the Regex API.

RegexDemo

I created an application RegexDemoto demonstrate Java regular expressions and various methods of the Pattern, Matcherand PatternSyntaxException. Below is the source code for this demo application. Listing 1. Regular expression demonstration
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
public class RegexDemo
{
   public static void main(String[] args)
   {
      if (args.length != 2)
      {
         System.err.println("usage: java RegexDemo regex input");
         return;
      }
      // Преобразуем символьные последовательности начала новой строки (\n) в символы начала строки.
      args[1] = args[1].replaceAll("\\\\n", "\n");
      try
      {
         System.out.println("regex = " + args[0]);
         System.out.println("input = " + args[1]);
         Pattern p = Pattern.compile(args[0]);
         Matcher m = p.matcher(args[1]);
         while (m.find())
            System.out.println("Found [" + m.group() + "] starting at "
                               + m.start() + " and ending at " + (m.end() - 1));
      }
      catch (PatternSyntaxException pse)
      {
         System.err.println("Неправильное регулярное выражение: " + pse.getMessage());
         System.err.println("Описание: " + pse.getDescription());
         System.err.println("Позиция: " + pse.getIndex());
         System.err.println("Неправильный шаблон: " + pse.getPattern());
      }
   }
}
mainThe first thing a class method does RegexDemois check its command line. It requires two arguments: the first is a regular expression, and the second is the input text in which the regular expression will be searched. You may need to use a newline character within the input text (\n). This can only be done by specifying the character \followed by the character n. The function main()converts this character sequence to the Unicode value 10. Regular Expressions in Java, Part 1 - 2The bulk of the code RegexDemois enclosed in the try-catch. The block tryfirst outputs the given regular expression and the input text, and then creates an object Patternthat stores the compiled regular expression (regular expressions are compiled to improve pattern matching performance). A matcher is extracted from the object Patternand used to search for matches iteratively until all are found. The block catchcalls several class methods PatternSyntaxExceptionto retrieve useful information about the exception. This information is sequentially output to the output stream. There is no need to know the details of how the code works yet: they will become clear when we study the API in the second part of the article. However, you must compile Listing 1. Take the code from Listing 1, and then type the following command at the command prompt to compile RegexDemo: javac RegexDemo.java

The Pattern class and its constructs

The class Pattern, the first of three classes that make up the Regex API, is a compiled representation of a regular expression. The class SDK documentation Patterndescribes a variety of regular expression constructs, but if you don't actively use regular expressions, parts of this documentation may be confusing. What are quantifiers and what is the difference between greedy, reluctant and possessive quantifiers? What are character classes, boundary matchers, back references, and embedded flag expressions? I will answer these and other questions in the following sections.

Literal strings

The simplest regular expression construct is a literal string. For pattern matching to be successful, some part of the input text must match the pattern of that construct. Consider the following example: java RegexDemo apple applet In this example, we are trying to find a match for a pattern applein the input text applet. The following result shows the match found:
regex = apple
input = applet
Found [apple] starting at 0 and ending at 4
We see in the output the regular expression and the input text, and then an indication of successful detection applein the applet. In addition, the starting and ending positions of this match are given: 0and 4, respectively. The start position indicates the first place in the text where a match was found, and the end position indicates the last point of the match. Now let's say we gave the following command line: java RegexDemo apple crabapple This time we get the following result, with different starting and ending positions:
regex = apple
input = crabapple
Found [apple] starting at 4 and ending at 8
Otherwise, with and appletas the regular expression apple- the input text, no matches will be found. The entire regular expression must match, but in this case, the input text does not contain tafter apple. Regular Expressions in Java, Part 1 - 3

Metacharacters

More interesting regular expression constructs combine literal characters with metacharacters. For example, in a regular expression a.b, the dot metacharacter (.)means any character between aand b. Consider the following example: java RegexDemo .ox "The quick brown fox jumps over the lazy ox." This example uses .oxboth as a regular expression and The quick brown fox jumps over the lazy ox.as input text. RegexDemosearches the text for matches starting with any character and ending with ox.The results of its execution are as follows:
regex = .ox
input = The quick brown fox jumps over the lazy ox.
Found [fox] starting at 16 and ending at 18
Found [ ox] starting at 39 and ending at 41
In the output we see two matches: foxand ox(with a space character in front of it). The metacharacter . matches a character fin the first case and a space in the second. What happens if you replace it .oxwith a metacharacter .? That is, what we get as a result of the following command line: java RegexDemo . "The quick brown fox jumps over the lazy ox." Since the dot metacharacter matches any character, RegexDemowill output matches found for all characters (including the trailing dot character) of the input text:
regex = .
input = The quick brown fox jumps over the lazy ox.
Found [T] starting at 0 and ending at 0
Found [h] starting at 1 and ending at 1
Found [e] starting at 2 and ending at 2
Found [ ] starting at 3 and ending at 3
Found [q] starting at 4 and ending at 4
Found [u] starting at 5 and ending at 5
Found [i] starting at 6 and ending at 6
Found [c] starting at 7 and ending at 7
Found [k] starting at 8 and ending at 8
Found [ ] starting at 9 and ending at 9
Found [b] starting at 10 and ending at 10
Found [r] starting at 11 and ending at 11
Found [o] starting at 12 and ending at 12
Found [w] starting at 13 and ending at 13
Found [n] starting at 14 and ending at 14
Found [ ] starting at 15 and ending at 15
Found [f] starting at 16 and ending at 16
Found [o] starting at 17 and ending at 17
Found [x] starting at 18 and ending at 18
Found [ ] starting at 19 and ending at 19
Found [j] starting at 20 and ending at 20
Found [u] starting at 21 and ending at 21
Found [m] starting at 22 and ending at 22
Found [p] starting at 23 and ending at 23
Found [s] starting at 24 and ending at 24
Found [ ] starting at 25 and ending at 25
Found [o] starting at 26 and ending at 26
Found [v] starting at 27 and ending at 27
Found [e] starting at 28 and ending at 28
Found [r] starting at 29 and ending at 29
Found [ ] starting at 30 and ending at 30
Found [t] starting at 31 and ending at 31
Found [h] starting at 32 and ending at 32
Found [e] starting at 33 and ending at 33
Found [ ] starting at 34 and ending at 34
Found [l] starting at 35 and ending at 35
Found [a] starting at 36 and ending at 36
Found [z] starting at 37 and ending at 37
Found [y] starting at 38 and ending at 38
Found [ ] starting at 39 and ending at 39
Found [o] starting at 40 and ending at 40
Found [x] starting at 41 and ending at 41
Found [.] starting at 42 and ending at 42
Quote metacharacters
To specify .or any other metacharacter as a literal character in a regular expression construct, you must escape it in one of the following ways:
  • precede it with a backslash character;
  • Place this metacharacter between \Qand \E(for example, \Q.\E).
Remember to duplicate any characters that appear in the string literal, such as String regex = "\\.";backslashes (for example, \\.or \\Q.\\E). Do not duplicate those backslashes that are part of a command line argument.

Character classes

Sometimes you have to limit the matches you're looking for to a specific set of characters. For example, search the text for the vowels a, e, i, oand u, with each occurrence of a vowel letter being considered a match. In solving such problems, we will be helped by character classes that define sets of characters between the metacharacters of square brackets ( [ ]). The class Patternsupports simple character classes, range classes, inverse, union, intersection, and subtraction classes. We'll look at all of them now.

Simple Character Classes

A simple character class consists of characters placed side by side and matches only those characters. For example, the class [abc]matches the characters a, band c. Consider the following example: java RegexDemo [csw] cave As you can see from the results, in this example only the character cfor which there is a match in cave:
regex = [csw]
input = cave
Found [c] starting at 0 and ending at 0

Inverted character classes

An inverted character class begins with a metacharacter ^and matches only those characters not contained in it. For example, the class [^abc]matches all characters except a, band c. Consider the following example: java RegexDemo "[^csw]" cave Note that on my operating system (Windows) double quotes are required because the shell treats them ^as an escape character. As you can see, in this example only the characters a, vand were found e, for which there are matches in cave:
regex = [^csw]
input = cave
Found [a] starting at 1 and ending at 1
Found [v] starting at 2 and ending at 2
Found [e] starting at 3 and ending at 3

Range character classes

A range character class consists of two characters separated by a hyphen ( -). All characters, starting with the character to the left of the hyphen and ending with the character to the right, are part of the range. For example, the range [a-z]matches all lowercase Latin letters. This is equivalent to defining a simple class [abcdefghijklmnopqrstuvwxyz]. Consider the following example: java RegexDemo [a-c] clown This example will only match the character cthat has a match in clown:
regex = [a-c]
input = clown
Found [c] starting at 0 and ending at 0
Regular Expressions in Java, Part 2 Regular Expressions in Java, Part 3 Regular Expressions in Java, Part 4 Regular Expressions in Java, Part 5
Comments
  • Popular
  • New
  • Old
You must be signed in to leave a comment
This page doesn't have any comments yet