JavaRush /Java Blog /Random EN /RegEx: 20 short steps for mastering regular expressions. ...

Level 40

Tallinn

8 August 2023
348 views
0 comments

RegEx: 20 short steps for mastering regular expressions. Part 1

The original of this article is here . There is probably never too much theory, and I will provide several links to more detailed material on regex at the end of the article. But, it seemed to me that it would be much more interesting to start delving into such a topic as regular expressions if there is an opportunity to engage not only in cramming, but also immediately consolidate knowledge by completing small tasks in the course of training. RegEx: 20 short steps for mastering regular expressions. Part 1 - 1

Perhaps let's get started. A common quotation against the use of regular expressions ('RegEx' or simply 'regex') in programming is attributed to Jamie Zawinski: "Some people, when faced with a problem, think, 'I know I'll use regular expressions.'" Now they have two Problems". In fact, using regular expressions is not yet a good or bad idea. And this in itself will not add problems and will not solve any of them. It's just a tool. And how you use it (correctly or incorrectly) determines what results you will see. If you try to use regex, for example, to create an HTML parser, then you will most likely experience pain. But if you just want to extract, for example, timestamps from some strings, you'll probably be fine. To make it easier for you to master regular expressions, I have put together this tutorial that will help you master regular expressions from scratch in just twenty short steps. This tutorial mainly focuses on the basic concepts of regular expressions and only dives into more advanced topics as needed.

Step 1: Why Use Regular Expressions

RegEx: 20 short steps for mastering regular expressions. Part 1 - 2

Regular expressions are used to search for matches in text according to given patterns (patterns). With the help of regex, we can easily and easily extract ~~the raisin from the cupcake~~ words from the text, as well as individual literal (literal) and meta (special) characters and their sequences that meet certain criteria. Here's what Wikipedia tells us about them : Regular expressions are a formal language for searching and manipulating substrings in text based on the use of metacharacters (wildcard characters). To search, a pattern string is used (English pattern, in Russian it is often called a "template", "mask"), consisting of symbols and metacharacters and setting the search rule. For manipulations with text, a replacement string is additionally specified, which can also contain special characters. The pattern can be as simple as the word dogin this sentence:

The quick brown fox jumps over the lazy dog.

This regular expression looks like this:

dog

... Easy enough, isn't it? The pattern can also be any word that contains the letter o. A regular expression to search for such a pattern might look like this:

\w * o\w *

( You can try this regular expression here), you can see that as the requirements for "matching" become more complex, the regular expression also becomes more complex. There are additional notations for specifying groups of characters and matching repeating patterns, which I'll explain below. But, as soon as we find a pattern match in some text, then what can we do with it? Modern regular expression engines allow you to extract characters or their sequences (substrings) from the contained text, or remove them, or replace them with other text. In general, regular expressions are used to parse and manipulate text. We can extract, for example, substrings that look like IP addresses and then try to test them. Or we can extract names and email addresses and store them in a database. Or use regular expressions, to find sensitive information (such as passport numbers or phone numbers) in emails and warn the user that they may be at risk. Regex is really a versatile tool that is easy to learn but hard to master: "Just as there is a difference between playing a piece of music well and making music, there is a difference between knowing regular expressions and understanding them." — Jeffrey E. F. Friedl, Mastering Regular Expressions

Step 2: square brackets`[]`

The simplest regular expressions that are easy to understand are those that just look for a character-by-character match between a regular expression pattern and a target string. Let's try to find a cat, for example: RegEx: 20 short steps for mastering regular expressions. Part 1 - 3

RegEx: 20 short steps for mastering regular expressions. Part 1 - 3

pattern: cat
string: The cat was cut when it ran under the car.
matches:      ^^^

( How it works in practice - see here ) NB! All solutions are presented here as alternative solutions only. In regular expressions, as in programming in general, you can solve the same problems in different ways. However, in addition to a strict character-by-character comparison, we can also specify alternative matches using square brackets:

pattern: ca[rt]
string: The cat was cut when it ran under the car.
matches:      ^^^ ^^^

( How it works ) The opening and closing square brackets tell the regular expression engine that it should look for any of the specified characters, but only one. The above regular expression will not find, for example, the cartwhole word, but only a part of it:

pattern: ca[rt]
string: The cat was cut when it ran under the cart.
matches:      ^^^ ^^^

( How It Works ) When you use square brackets, you tell the regular expression engine to only match one of the characters contained within the brackets. The engine finds a character c, then a character a, but if the next character is not ror t, then this is not a complete match. If it finds ca, and then either r, or t, it stops. It will not try to match more characters because the square brackets indicate that only one of the contained characters should be searched. When it finds ca, it next finds rin the word cart, and stops because it has already found a match for the sequence car.

Tasks for training:

Write a regular expression that finds all 10 pattern matches hadand Hadin this piece of untranslatable pun in the local dialect:

pattern:
string: Jim, where Bill had had "had" , had had "had had" . "Had had" had been correct.
matches:                  ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^

( See possible solution here ) What about all the animal names in the next sentence?

pattern:
string: A bat, a cat, and a rat walked into a bar...
matches:    ^^^ ^^^ ^^^

( Possible solution ) Or even simpler: find the words baror bat:

pattern:
string: A bat, a cat, and a rat walked into a bar...
matches:    ^^^ ^^^

( Possible solution ) Now we have already learned how to write more or less complex regular expressions, and we are only at step 2! We continue!

Step 3: escape sequences

RegEx: 20 short steps for mastering regular expressions. Part 1 - 4

In the previous step, we learned about square brackets []and how they help us find alternate matches using the regex engine. But what if we want to find matches in the form of open and closed square brackets themselves []? When we wanted to find a character-by-character match for the word cat, we provided the regex engine with that sequence of characters ( cat). Let's try to find square brackets []in the same way:

pattern: [] 
string: You ca n't match [] using regex! You will regret it!
matches:

( See what happened ) Something didn't work, however... This is because the square bracket characters work like special characters of the regex engine, which are usually used to mean something else, and are not a literal pattern for searching for them. As we remember from step 2, they are used to find alternate matches so that the regex engine can match any of the characters contained between them. If you don't put any characters between them, it may result in an error. To match these special characters, we must escape them by prefixing them with a backslash character.\. Backslash (or backslash) is another special character that tells the regex engine to look for the next character literally, rather than using it as a metacharacter. The regex engine will only look for characters literally [and ]only if they are both preceded by a backslash:

pattern: \[\]
string: You can't match [] using regex! You will regret it!
matches:                  ^^

( Let's see what happened this time ) OK, what if we want to find the backslash itself? The answer is simple. Since backslash \is also a special character, it must also be escaped. How? Backslash!

pattern: \\
string: C:\Users\Tanja\Pictures\Dogs
matches:    ^ ^ ^ ^

( Same example in action ) Only special characters must be preceded by a backslash. All other characters are interpreted literally by default. For example, the regular expression tliterally matches only a tlowercase letter:

pattern: t
string: tttt
matches: ^ ^ ^ ^

( Example ) However, a sequence like this \tworks differently. It is a template for searching for a tab character:

pattern: \t
string: tttt
matches:   ^ ^ ^

( Example ) Some common escape sequences include \n(UNIX-style line breaks) and \r(used in Windows-style line breaks, \r\n). \ris a "carriage return" character, and \nis a "line feed" character, both of which were defined along with the ASCII standard when teletypes were still in common use. Other common escape sequences will be covered later in this guide.

In the meantime, let's fix the material with a couple of simple tasks:

Try writing a regex to search... regex ;) The result should be something like this:

pattern:
string: ...match this regex ` \[\] ` with a regex?
matches:                       ^^^^

( Decision ) Did you do it? Well done! Now try creating a regex to look for these escape sequences:

pattern:
string: ` \r `, ` \t `, and ` \n ` are all regex escape sequences.
matches:   ^^ ^^ ^^

( Decision )

Step 4: look for "any" (any) character with a dot`.`

RegEx: 20 short steps for mastering regular expressions. Part 1 - 5

When writing the solutions for finding the escape sequences we saw in the previous step, you may have wondered, "Can I match a backslash character and then any other character that follows it?"... Of course you can! There is another special character that is used to match (almost) any character - the dot (full stop) character. Here is what it does:

pattern: .
string: I'm sorry, Dave. I'm afraid I can't do that.
matches: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

( Example ) If you only want to match patterns that look like escape sequences, you can do something like this:

pattern: \\. 
string: Hi Walmart is my grandson there his name is " \n \r \t ".
matches:                                              ^^ ^^ ^^

( Example ) And, as with all special characters, if you want to match a literal ., then you need to precede it with a \:

pattern: \. 
string: War is Peace . Freedom is Slavery . Ignorance is Strength . 
matches:             ^ ^ ^

( Example )

Step 5: character ranges

RegEx: 20 short steps for mastering regular expressions. Part 1 - 6

What if you do not need any characters, but you want to find only letters in the text? Or numbers? Or vowels? Searching for character classes and their ranges will allow us to achieve this.

` \n `, ` \r `, and ` \t ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .

Characters are "spaces" if they do not create a visible mark in the text. A space " " is a space, line break, or tab. Suppose we want to find escape sequences representing only whitespace characters \n, \rand \tin the above passage, but not other escape sequences. How could we do it?

pattern: \\[nrt] 
string: ` \n `, ` \r `, and ` \t ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:   ^^ ^^ ^^

( Example ) This works, but it's not a very elegant solution. What if we later need to match the escape sequence for the "form feed" character, \f? (This character is used to indicate page breaks in text.)

pattern: \\[nrt] 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:   ^^ ^^ ^^

( Not working solution ) With this approach, we need to separately enumerate each lowercase letter we want to match in square brackets. An easier way to do this is to use character ranges to match any lowercase letter:

pattern: \\[az] 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:   ^^ ^^ ^^ ^^

( This already works .) Character ranges work as you might expect, given the example above. Put in square brackets the first and last letters you want to match, with a hyphen between them. For example, if you only want to find "sets" of backslash \and a single letter from ato m, you can do the following:

pattern: \\[am] 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:                         ^^

( Example ) If you want to match multiple ranges, just put them back to back between square brackets:

pattern: \\[a-gq-z] 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:         ^^ ^^ ^^

( Example ) Other common character ranges include: A-Zand0-9

Let's try them in practice, and solve a couple of problems:

Hexadecimal numbers can contain numbers 0-9as well as letters A-F. When used to specify colors, hexadecimal codes can contain up to three characters. Create a regular expression to find valid hex codes in the list below:

pattern:
string: 1H8 4E2 8FF 0P1 T8B 776 42B G12
matches:      ^^^ ^^^ ^^^ ^^^

( Solution ) Using character ranges, create a regular expression that will only select lowercase consonants (not vowels, including y) in the sentence below:

pattern:
string: T h e w a lls i n th e m a ll a r e t o t a lly , t o t a lly  t a ll .
matches:   ^ ^ ^^^ ^ ^^ ^ ^^ ^ ^ ^ ^^^ ^ ^ ^^^ ^ ^^

( Decision )

Step 6: "not", caret, circumflex, caret... character`^`

RegEx: 20 short steps for mastering regular expressions. Part 1 - 7

Indeed, this symbol has over 9000 names :) But, for simplicity, let's stop at "not". My solution to the last problem is a bit long. It took 17 characters to say "get the whole alphabet except vowels". Of course, there is an easier way to do this. The "not" sign ^allows us to define characters and ranges of characters that must not match those specified in the pattern. An easier solution to the last problem above is to find characters that do not represent vowels:

pattern: [^aeiou] 
string:   Th e w a lls i n th e m a ll a r e t o t a lly, t o t a lly  t a ll. 
matches: ^^ ^^ ^^^^ ^^^^ ^^ ^^^ ^ ^^ ^ ^^^^^^ ^ ^^^^^ ^^^

( Example ) The "not" character ^as the leftmost character in square brackets []tells the regular expression engine to match one (any) character that is not in square brackets. This means that the regex above also matches all spaces, dots ., commas ,, and capitals Tat the beginning of a sentence. To exclude them, we can just as well put them in square brackets:

pattern: [^aeiou .,T]  
string: T h e w a lls i n th e m a ll a r e t o t a lly , t o t a lly  t a ll .
matches:   ^ ^ ^^^ ^ ^^ ^ ^^ ^ ^ ^ ^^^ ^ ^ ^^^ ^ ^^

( Example ) note, that in this case, we don't need to escape the dot with a backslash, as we did before when looking for it without using square brackets. Many special characters in square brackets are treated literally, including the open [-but-not-close ]bracket character (guess why?). The backslash character \is also not taken literally. If you want to match a literal (literal) backslash \using square brackets, then you must escape it by preceding it with the following backslash \\. This behavior was intended so that whitespace characters can also be placed in square brackets for matching:

pattern: [\t]
string: tttt
matches:   ^ ^ ^

( Example ) The "not" sign ^can also be used with ranges. If I wanted to capture only the characters a, b, c, x, yand z, I could do something like this:

pattern: [abcxyz] 
string:   abc defghijklmnopqrstuvw xyz 
matches: ^^^ ^^^

( Example ) ... or, I could specify that I want to find any character that is not between dand w:

pattern: [^dw] 
string:   abc defghijklmnopqrstuvw xyz 
matches: ^^^ ^^^

( Example ) However,be carefulwith "not" ^. It's easy to think "well, I specified [^ b-f], so I should get a lowercase letter aor something after f. That's not the case. This regular expression will match any character not in that range, including letters, numbers, punctuation, and spaces.

pattern: [^dw] 
string:   abc defg h . i , j - klmnopqrstuvw xyz 
matches: ^^^ ^ ^ ^ ^ ^^^

( Example )

Tasks for pumping:

Use "not" ^in square brackets to match all words below that do not end in y:

pattern:
string: day dog hog hay bog bay ray rub 
matches:      ^^^ ^^^ ^^^ ^^^

( Solution ) Write a regular expression using a range and a "not" sign ^to find all years between 1977 and 1982 (inclusive):

pattern:
string: 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
matches:            ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^

( Solution ) Write a regular expression to search for all characters that are not a "not" sign character ^:

pattern:
string:   abc1 ^ 23*() 
matches: ^^^^ ^^^^^

( Decision )

Step 7: Character classes

Character classes are even simpler than character ranges. Different regex engines have different classes available, so I'll only cover the main ones here. (Check which version of regex you're using, because there may be more - or they may differ from those shown here.) Character classes work much like ranges, except that you can't specify 'start' and 'end' values:

Class	symbols
`\d`	"numbers"`[0-9]`
`\w`	"word characters"`[A-Za-z0-9_]`
`\s`	"gaps"`[ \t\r\n\f]`

The "word" character class \wis especially useful because this character set is often required for valid identifiers (names of variables, functions, etc.) in various programming languages. We can use \wto simplify the regular expression we saw earlier:

pattern: \\[az] 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:   ^^ ^^ ^^ ^^

Using \wwe can write like this:

pattern: \\\w 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:   ^^ ^^ ^^ ^^

( Example )

2 tasks for good luck:

As you and I know, in Java, an identifier (name of a variable, class, function, etc.) can only begin with the letter a- zA- Z, dollar sign $, or underscore _. ( underlining, of course, is bad style, but the compiler skips it, translator's note ). The rest of the characters must be "word" characters \w. Using one or more character classes, create a regular expression to search for valid Java identifiers among the following three character sequences:

pattern:
string:   __e $12 .x2 foo Bar 3mm
matches: ^^^ ^^^ ^^^ ^^^

( Solution ) US Social Security Numbers (SSNs) are 9-digit numbers in the format XXX-XX-XXXX, where each X can be any digit [0-9]. Using one or more character classes, write a regular expression to find properly formatted SSNs in the list below:

pattern:
string: 113-25=1902 182-82-0192 H23-_3-9982 1I1-O0-E38B
matches:              ^^^^^^^^^^^

( Solution ) RegEx: 20 Short Steps to Master Regular Expressions. Part 2. 20 short steps for mastering regular expressions. Part 3. RegEx: 20 Short Steps to Master Regular Expressions. Part 4

Comments

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION

RegEx: 20 short steps for mastering regular expressions. Part 1

Step 1: Why Use Regular Expressions

Step 2: square brackets[]

Tasks for training:

Step 3: escape sequences

In the meantime, let's fix the material with a couple of simple tasks:

Step 4: look for "any" (any) character with a dot.

Step 5: character ranges

Let's try them in practice, and solve a couple of problems:

Step 6: "not", caret, circumflex, caret... character^

Tasks for pumping:

Step 7: Character classes

2 tasks for good luck:

Step 2: square brackets`[]`

Step 4: look for "any" (any) character with a dot`.`

Step 6: "not", caret, circumflex, caret... character`^`