JavaRush /Java Blog /Random EN /Regular expressions in Java, part 5

Regular expressions in Java, part 5

Published in the Random EN group
We present to your attention a translation of a short guide to regular expressions in Java, written by Jeff Friesen for the javaworld website. For ease of reading, we have divided the article into several parts. This part is the final one. Regular Expressions in Java, Part 5 - 1Regular Expressions in Java, Part 1 Regular Expressions in Java, Part 2 Regular Expressions in Java, Part 3 Regular Expressions in Java, Part 4

Using regular expressions for lexical analysis

An even more useful application of regular expressions is a library of reusable code for performing lexical analysis, a key component of any compiler or assembler. In this case, the input stream of characters is grouped into an output stream of tokens - names for sequences of characters that have a common meaning. For example, having stumbled upon the sequence of characters c, o, u, n, t, e, r, in the input stream, the lexical analyzer can output a token ID(identifier). The sequence of characters corresponding to the token is called a lexeme.
More about markers and lexemes
Tokens such as ID can match many character sequences. In the case of such tokens, the actual token corresponding to the token is also needed by the compiler, assembler, or other utility that requires lexical analysis. For tokens that represent one specific sequence of characters, such as the token PLUScorresponding only to the character +, the actual token is not required since it can be [uniquely] determined by the token.
Regular expressions are much more efficient than state-based lexical analyzers, which must be written by hand and generally cannot be reused. An example of a regular expression-based lexical analyzer is JLex , a lexical generator for the Java language that uses regular expressions to define rules for breaking up an input data stream into tokens. Another example is Lexan.

Getting to know Lexan

Lexan is a reusable Java library for lexical analysis. It is based on code from the blog post series Writing a Parser in Java on the Cogito Learning website . The library consists of the following classes, which are in the package ca.javajeff.lexanincluded in the downloadable code for this article:
  • Lexan: lexical analyzer;
  • LexanException: exception thrown in the class constructorLexan;
  • LexException: exception thrown if incorrect syntax is detected during lexical analysis;
  • Token: name with a regular expression attribute;
  • TokLex: token/token pair.
The constructor Lexan(java.lang.Class tokensClass)creates a new lexical analyzer. It requires one argument in the form of a class object java.lang.Classcorresponding to the type constant class static Token. Using the Reflection API, the constructor reads all constants Tokeninto an array of values Token[]. If Tokenthere are no constants, an exception is thrown LexanException. Regular Expressions in Java, Part 5 - 2The class Lexanalso provides the following two methods:
  • The method returns a list of this lexer;List getTokLexes() Token
  • Метод void lex(String str)performs a lexical analysis of the input string [with the result placed] into a list of values ​​of type TokLex. If a character is encountered that does not match any of the array patterns Token[], an exception is thrown LexException.
The class LexanExceptionhas no methods; it uses an inherited method to return an exception message getMessage(). In contrast, the class LexExceptionprovides the following methods:
  • The method int getBadCharIndex()returns the position of a character that does not match any of the marker patterns.
  • The method String getText()returns the text that was analyzed when the exception was generated.
The class Tokenoverrides the method toString()to return the name of the marker. It also provides a method String getPattern()that returns the token's regular expression attribute. The class TokLexprovides a method Token getToken()that returns its token. It also provides a method String getLexeme()that returns its token.

Demonstration of the Lexan library

To demonstrate how the library works, LexanI wrote an application LexanDemo. It consists of classes LexanDemo, BinTokens, MathTokensand NoTokens. The source code for the application LexanDemois shown in Listing 2. Listing 2. Demonstration of the Lexan library in action
import ca.javajeff.lexan.Lexan;
import ca.javajeff.lexan.LexanException;
import ca.javajeff.lexan.LexException;
import ca.javajeff.lexan.TokLex;

public final class LexanDemo
{
   public static void main(String[] args)
   {
      lex(MathTokens.class, " sin(x) * (1 + var_12) ");
      lex(BinTokens.class, " 1 0 1 0 1");
      lex(BinTokens.class, "110");
      lex(BinTokens.class, "1 20");
      lex(NoTokens.class, "");
   }

   private static void lex(Class tokensClass, String text)
   {
      try
      {
         Lexan lexan = new Lexan(tokensClass);
         lexan.lex(text);
         for (TokLex tokLex: lexan.getTokLexes())
            System.out.printf("%s: %s%n", tokLex.getToken(),
                              tokLex.getLexeme());
      }
      catch (LexanException le)
      {
         System.err.println(le.getMessage());
      }
      catch (LexException le)
      {
         System.err.println(le.getText());
         for (int i = 0; i < le.getBadCharIndex(); i++)
            System.err.print("-");
         System.err.println("^");
         System.err.println(le.getMessage());
      }
      System.out.println();
   }
}
The method main()in Listing 2 calls a utility lex()to demonstrate lexical analysis using Lexan. Each call to this method is passed the class of the tokens in the object Classand the string to parse. The method lex()first creates an object of the class Lexanby passing the object Classto the class constructor Lexan. And then it calls the lex()class method Lexanon that string. If the lexical analysis is successful, the class TokLexmethod is called to return a list of objects . For each of these objects, its class method is called to return the token and its class method to return the token. Both values ​​are printed to standard output. If lexical analysis fails, one of the exceptions or is thrown and handled accordingly . For brevity, let's consider only the class that makes up this application . Listing 3 shows its source code. Listing 3. Description of a set of tokens for a small mathematical languagegetTokLexes()LexangetToken()TokLexgetLexeme()LexanExceptionLexExceptionMathTokens
import ca.javajeff.lexan.Token;

public final class MathTokens
{
   public final static Token FUNC = new Token("FUNC", "sin|cos|exp|ln|sqrt");
   public final static Token LPAREN = new Token("LPAREN", "\\(");
   public final static Token RPAREN = new Token("RPAREN", "\\)");
   public final static Token PLUSMIN = new Token("PLUSMIN", "[+-]");
   public final static Token TIMESDIV = new Token("TIMESDIV", "[*/]");
   public final static Token CARET = new Token("CARET", "\\^");
   public final static Token INTEGER = new Token("INTEGER", "[0-9]+");
   public final static Token ID = new Token("ID", "[a-zA-Z][a-zA-Z0-9_]*");
}
Listing 3 shows that the class MathTokensdescribes a sequence of constants of type Token. Each of them is assigned the value of an object Token. The constructor for this object receives a string that is the name of the marker, along with a regular expression that describes all the character strings associated with that marker. For clarity, it is desirable that the string name of the marker be the same as the name of the constant, but this is not required. Regular Expressions in Java, Part 5 - 3The position of the constant Tokenin the list of markers is important. Constants located higher in the list Tokentake precedence over those located below. For example, when encountering sin, Lexan chooses the token FUNCinstead of ID. If the token IDhad preceded the token FUNC, it would have been selected.

Compiling and running the LexanDemo application

The downloadable code for this article includes an archive lexan.zipcontaining all the files of the Lexan distribution. Unpack this archive and go to a subdirectory demosof the root directory lexan. If you are using Windows, run the following command to compile the demo application source code files:
javac -cp ..\library\lexan.jar *.java
If compilation is successful, run the following command to run the demo application:
java -cp ..\library\lexan.jar;. LexanDemo
You should see the following results:
FUNC: sin
LPAREN: (
ID: x
RPAREN: )
TIMESDIV: *
LPAREN: (
INTEGER: 1
PLUSMIN: +
ID: var_12
RPAREN: )
ONE: 1
ZERO: 0
ONE: 1
ZERO: 0
ONE: 1
ONE: 1
ONE: 1
ZERO: 0
1 20
--^
Неожиданный символ во входном тексте: 20
The message Неожиданный символ во входном тексте: 20occurs as a result of an exception being thrown LexanExceptiondue to the fact that the class BinTokensdoes not declare a constant Tokenwith a value 2as a regular expression. Note that the exception handler outputs the position of the inappropriate character obtained from lexical analysis of the text. The tokens missing message is the result of an exception being thrown LexExceptionbecause NoTokensno constants are declared in the class Token.

Behind the scenes

Lexanuses the Lexan class as its engine. Take a look at the implementation of this class in Listing 4 and note the contribution of regular expressions to making the engine reusable. Listing 4. Creating a lexical analyzer architecture based on regular expressions
package ca.javajeff.lexan;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;

/**
 *  Лексический анализатор. Этот класс можно использовать для
 *  преобразования входного потока символов в выходной поток маркеров.
 *
 *  @Author Джефф Фризен
 */

public final class Lexan
{
   private List tokLexes;

   private Token[] values;

   /**
    *  Инициализируем лексический анализатор набором an objectов Token.
    *
    *  @параметры tokensClass – an object Class класса, содержащего
    *       набор an objectов Token
    *
    *  @генерирует исключение LexanException в случае невозможности
    *       формирования an object Lexan, возможно, из-за отсутствия an objectов
    *       Token в классе
    */

   public Lexan(Class tokensClass) throws LexanException
   {
      try
      {
         tokLexes = new ArrayList<>();
         List _values = new ArrayList<>();
         Field[] fields = tokensClass.getDeclaredFields();
         for (Field field: fields)
            if (field.getType().getName().equals("ca.javajeff.lexan.Token"))
               _values.add((Token) field.get(null));
         values = _values.toArray(new Token[0]);
         if (values.length == 0)
            throw new LexanException("маркеры отсутствуют");
      }
      catch (IllegalAccessException iae)
      {
         throw new LexanException(iae.getMessage());
      }

   /**
    * Получаем список TokLex'ов этого лексического анализатора.
    *
    *  @возвращает список TokLex'ов
    */

   public List getTokLexes()
   {
      return tokLexes;
   }

   /** * Выполняет лексический анализ входной строки [с помещением * результата] в список TokLex'ов. * * @параметры str – строка, подвергаемая лексическому анализу * * @генерирует исключение LexException: во входных данных обнаружен * неожиданный символ */

   public void lex(String str) throws LexException
   {
      String s = new String(str).trim(); // удалить ведущие пробелы
      int index = (str.length() - s.length());
      tokLexes.clear();
      while (!s.equals(""))
      {
         boolean match = false;
         for (int i = 0; i < values.length; i++)
         {
            Token token = values[i];
            Matcher m = token.getPattern().matcher(s);
            if (m.find())
            {
               match = true;
               tokLexes.add(new TokLex(token, m.group().trim()));
               String t = s;
               s = m.replaceFirst("").trim(); // удалить ведущие пробелы
               index += (t.length() - s.length());
               break;
            }
         }
         if (!match)
            throw new LexException("Неожиданный символ во входном тексте: "
                                    + s, str, index);
      }
   }
}
The method code lex()is based on the code provided in the blog post "Writing a Parser in Java: A Token Generator" on the Cogito Learning website. Read this post to learn more about how Lexan uses the Regex API to compile code. Regular Expressions in Java, Part 5 - 4

Conclusion

Regular expressions are a useful tool that can be useful to any developer. The Regex API of the Java programming language makes them easy to use in applications and libraries. Now that you already have a basic understanding of regular expressions and this API, take a look at the SDK documentation java.util.regexto learn even more about regular expressions and additional Regex API methods.
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION