JavaRush /Java Blog /Random EN /RegEx: 20 short steps to master regular expressions. Part...

Level 40

Tallinn

28 February 2021
44 views
0 comments

RegEx: 20 short steps to master regular expressions. Part 1

The original of this article is here . Probably there is no such thing as too much theory, and I will provide several links to more detailed material on regex at the end of the article. But it seemed to me that starting to delve into a topic like regular expressions would be much more interesting if there was an opportunity to not only cram, but also immediately consolidate knowledge by completing small tasks along the way. RegEx: 20 short steps to master regular expressions. Part 1 - 1

Let's get started. Typically opponents of using regular expressions ('RegEx' or simply 'regex') in programming cite the following quote, attributed to Jamie Zawinski: "Some people, when faced with a problem, think, 'I know, I'll use regular expressions.'" Now they have two Problems". In fact, using regular expressions is not yet a good or bad idea. And this in itself will not add problems and will not solve any of them. It's just a tool. And how you use it (right or wrong) determines what results you'll see. If you try to use regex, for example, to create an HTML parser, then you will most likely experience pain . But if you just want to extract, for example, timestamps from some rows, you'll probably be fine. To make it easier for you to master regular expressions, I've put together this lesson that will help you master regular expressions from scratch in just twenty short steps. This tutorial mainly focuses on the basic concepts of regular expressions and delves into more advanced topics only as necessary.

Step 1: Why use regular expressions

RegEx: 20 short steps to master regular expressions. Part 1 - 2

Regular expressions are used to search for matches in text using specified patterns (patterns). Using regex, we can easily and simply extract ~~words~~ from text, as well as individual literal and meta characters and their sequences that meet certain criteria. Here's what Wikipedia tells us about them : Regular expressions are a formal language for searching and manipulating substrings in text, based on the use of metacharacters (wildcard characters). For the search, a sample string (English pattern, in Russian it is often called “template”, “mask”) is used, consisting of symbols and metasymbols and defining the search rule. To manipulate text, a replacement string is additionally specified, which can also contain special characters. The pattern can be as simple as the word dogin this sentence:

The quick brown fox jumps over the lazy dog.

This regular expression looks like this:

dog

...Easy enough, isn't it? The pattern can also be any word that contains the letter o. A regular expression to find such a pattern might look like this:

\w * o\w *

( You can try this regular expression here .) You will notice that as the "matching" requirements become more complex, the regular expression also becomes more complex. There are additional forms of notation for specifying groups of characters and matching repeating patterns, which I will explain below. But, as soon as we find a match to a pattern in some text, then what can we do with it? Modern regular expression engines allow you to extract characters or sequences of characters (substrings) from contained text, or remove them, or replace them with other text. In general, regular expressions are used to parse and manipulate text. We can extract, for example, substrings that look like IP addresses and then try to verify them. Or we can extract names and email addresses and store them in a database. Or use regular expressions to find sensitive information (such as passport numbers or phone numbers) in emails and alert the user that they may be putting themselves at risk. Regex is truly a versatile tool that is easy to learn but difficult to master: “Just as there is a difference between playing a piece of music well and creating music, there is a difference between knowing regular expressions and understanding them.” - Jeffrey E. F. Friedl, Mastering Regular Expressions

Step 2: Square Brackets`[]`

The simplest regular expressions that are easy to understand are those that simply look for a character-by-character match between the regular expression pattern and the target string. Let's, for example, try to find a cat: RegEx: 20 short steps to master regular expressions. Part 1 - 3

RegEx: 20 short steps to master regular expressions. Part 1 - 3

pattern: cat
string: The cat was cut when it ran under the car.
matches:      ^^^

( How it works in practice - see here ) NB! All solutions are presented here as possible solutions only. In regular expressions, as in programming in general, you can solve the same problems in different ways. However, in addition to a strict character-by-character comparison, we can also specify alternative matches using square brackets:

pattern: ca[rt]
string: The cat was cut when it ran under the car.
matches:      ^^^ ^^^

( How it works ) Opening and closing square brackets tell the regular expression engine that it should match any of the specified characters, but only one. The above regular expression will not find, for example, the cartwhole word, but will find only part of it:

pattern: ca[rt]
string: The cat was cut when it ran under the cart.
matches:      ^^^ ^^^

( How it works ) When you use square brackets, you tell the regular expression engine to match only one of the characters contained within the brackets. The engine finds the character c, then the character a, but if the next character is not ror t, then this is not a complete match. If it finds ca, and then either r, or t, it stops. It won't try to match more characters because the square brackets indicate that only one of the contained characters needs to be matched. When it finds ca, it finds rin the word next cart, and stops because it has already found a match for the sequence car.

Training objectives:

Write a regular expression that matches all 10 patterns hadin Hadthis excerpt of untranslatable puns in the local dialect:

pattern:
string: Jim, where Bill had had "had" , had had "had had" . "Had had" had been correct.
matches:                  ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^ ^^^

( See possible solution here ) What about all the animal names in the following sentence?

pattern:
string: A bat, a cat, and a rat walked into a bar...
matches:    ^^^ ^^^ ^^^

( Possible solution ) Or even simpler: find the words baror bat:

pattern:
string: A bat, a cat, and a rat walked into a bar...
matches:    ^^^ ^^^

( Possible solution ) Now we have already learned how to write more or less complex regular expressions, and we are only at step 2! Let's continue!

Step 3: Escape Sequences

RegEx: 20 short steps to master regular expressions. Part 1 - 4

In the previous step, we learned about square brackets []and how they help us find alternative matches using the regex engine. But what if we want to find matches in the form of open and closed square brackets themselves []? When we wanted to find a character-by-character match of the word cat, we provided the regex engine with this sequence of characters ( cat). Let's try to find square brackets []in the same way:

pattern: [] 
string: You can 't match [] using regex! You will regret this!
matches:

( Let's see what happened ) Something didn't work, however... This is because the square bracket characters act as special regex engine characters that are usually used to indicate something else, and are not a literal pattern to match them themselves. As we remember from step 2, they are used to find alternative matches so that the regex engine can match any of the characters contained between them. If you don't put any characters between them, it may cause an error. To match these special characters, we must escape them by preceding them with a backslash character \. Backslash (or backslash) is another special character that tells the regex engine to look for the next character literally, rather than using it as a metacharacter. The regex engine will only look for characters [and ]literally if they are both preceded by a backslash:

pattern: \[\]
string: You can't match [] using regex! You will regret this!
matches:                  ^^

( Let's see what happened this time ) OK, what if we want to find the backslash itself? The answer is simple. Since backslash \is also a special character, it also needs to be escaped. How? Backslash!

pattern: \\
string: C:\Users\Tanja\Pictures\Dogs
matches:    ^ ^ ^ ^

( Same example in practice ) Only special characters must be preceded by a backslash. All other characters are interpreted literally by default. For example, the regular expression tliterally matches only tlowercase letters:

pattern: t
string: tttt
matches: ^ ^ ^ ^

( Example ) However, this sequence \tworks differently. It is a template for searching for a tab character:

pattern: \t
string: tttt
matches:   ^ ^ ^

( Example ) Some common escape sequences include \n(UNIX-style line breaks) and \r(used in Windows-style line breaks \r\n). \ris a "carriage return" character and \nis a "line feed" character, both of which were defined along with the ASCII standard when teletypewriters were still in widespread use. Other common escape sequences will be covered later in this tutorial.

In the meantime, let’s reinforce the material with a couple of simple puzzles:

Try writing a regular expression to find... a regular expression ;) The result should be something like this:

pattern:
string: ...match this regex ` \[\] ` with a regex?
matches:                       ^^^^

( Solution ) Did you manage? Well done! Now try creating a regex to search for escape sequences like this:

pattern:
string: ` \r `, ` \t `, and ` \n ` are all regex escape sequences.
matches:   ^^ ^^ ^^

( Solution )

Step 4: look for "any" character using a dot`.`

RegEx: 20 short steps to master regular expressions. Part 1 - 5

When writing the escape sequence matching solutions we saw in the previous step, you may have wondered, "Can I match the backslash character and then any other character that follows it?"... Of course you can! There is another special character that is used to match (almost) any character - the dot (full stop) character. Here's what it does:

pattern: .
string: I'm sorry, Dave. I'm afraid I can't do that.
matches: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

( Example ) If you only want to match patterns that look like escape sequences, you can do something like this:

pattern: \\. 
string: Hi Walmart is my grandson there his name is " \n \r \t ".
matches:                                              ^^ ^^ ^^

( Example ) And, as with all special characters, if you want to match a literal ., then you need to precede it with a character \:

pattern: \. 
string: War is Peace . Freedom is Slavery . Ignorance is Strength . 
matches:             ^ ^ ^

( Example )

Step 5: Character Ranges

RegEx: 20 short steps to master regular expressions. Part 1 - 6

What if you don't need any symbols, but only want to find letters in the text? Or numbers? Or vowels? Searching by character classes and their ranges will allow us to achieve this.

` \n `, ` \r `, and ` \t ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .

Characters are "whitespace" if they do not create a visible mark in the text. A space " " is a space, line break, or tab. Let's say we want to find escape sequences that represent only whitespace characters \n, \rand \tin the above passage, but not other escape sequences. How could we do this?

pattern: \\[nrt] 
string: ` \n `, ` \r `, and ` \t ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:   ^^ ^^ ^^

( Example ) This works, but it's not a very elegant solution. What if later we need to match the escape sequence for the "form feed" character, \f? (This symbol is used to indicate page breaks in text.)

pattern: \\[nrt] 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:   ^^ ^^ ^^

( Not working solution ) With this approach, we need to separately list each lowercase letter we want to match, in square brackets. An easier way to do this is to use character ranges to match any lowercase letter:

pattern: \\[az] 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:   ^^ ^^ ^^ ^^

( And this already works ) Character ranges work as you might expect, given the example above. Place square brackets around the first and last letters you want to match, with a hyphen in between. For example, if you only wanted to find "sets" of backslash \and one letter from ato m, you could do the following:

pattern: \\[am] 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:                         ^^

( Example ) If you want to match multiple ranges, simply place them end-to-end between square brackets:

pattern: \\[a-gq-z] 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:         ^^ ^^ ^^

( Example ) Other common character ranges include: A-Zand0-9

Let's try them in practice and solve a couple of problems:

Hexadecimal numbers can contain digits 0-9as well as letters A-F. When used to specify colors, hexadecimal codes can contain a maximum of three characters. Create a regular expression to find valid hexadecimal codes in the list below:

pattern:
string: 1H8 4E2 8FF 0P1 T8B 776 42B G12
matches:      ^^^ ^^^ ^^^ ^^^

( Solution ) Using character ranges, create a regular expression that will select only lowercase consonants (not vowels, including y) in the sentence below:

pattern:
string: T h e w a lls i n th e m a ll a r e t o t a lly , t o t a lly  t a ll .
matches:   ^ ^ ^^^ ^ ^^ ^ ^^ ^ ^ ^ ^^^ ^ ^ ^^^ ^ ^^

( Solution )

Step 6: "not", caret, circumflex, caret... symbol`^`

RegEx: 20 short steps to master regular expressions. Part 1 - 7

Truly, there are over 9000 names for this symbol :) But, for simplicity, perhaps we’ll focus on “not”. My solution to the last problem is a bit long. It took 17 characters to say "get the entire alphabet except the vowels." Of course there is an easier way to do this. The "not" sign ^allows us to specify characters and ranges of characters that must not match those specified in the pattern. A simpler solution to the last problem above is to find characters that do not represent vowels:

pattern: [^aeiou] 
string:   Th e w a lls i n th e m a ll a r e t o t a lly, t o t a lly  t a ll. 
matches: ^^ ^^ ^^^^ ^^^^ ^^ ^^^ ^ ^^ ^ ^^^^^^ ^ ^^^^^ ^^^

( Example ) The "not" sign ^as the leftmost character within the square brackets []tells the regular expression engine to match one (any) character that is not in the square brackets. This means that the above regular expression also matches all spaces, periods ., commas ,, and capitals Tat the beginning of a sentence. To exclude them, we can also put them in square brackets:

pattern: [^aeiou .,T] string  
: T h e w a lls i n th e m a ll a r e t o t ally , t o t a lly t a ll . 
matches:   ^ ^ ^^^ ^ ^^ ^ ^^ ^ ^ ^ ^^^ ^ ^ ^^^ ^ ^^

( Example ) notethat in this case, we don't need to escape the period with a backslash, as we did before when we looked for it without using square brackets. Many special characters in square brackets are treated literally, including the open [- but not the closing - ]bracket character (can you guess why?). The backslash character \is also not interpreted literally. If you want to match a literal backslash \using square brackets, then you must escape it by preceding it with the following backslash \\. This behavior was designed so that whitespace characters could also be placed in square brackets for matching:

pattern: [\t]
string: tttt
matches:   ^ ^ ^

( Example ) The "not" sign ^can also be used with ranges. If I wanted to capture only the characters a, b, c, x, yand z, I could do something like this:

pattern: [abcxyz] 
string:   abc defghijklmnopqrstuvw xyz 
matches: ^^^ ^^^

( Example ) ...or, I could specify that I want to find any character that is not between dand w:

pattern: [^dw] 
string:   abc defghijklmnopqrstuvw xyz 
matches: ^^^ ^^^

( Example ) However,be carefulwith "not" ^. It's easy to think "well, I specified [^ b-f], so I should get a lowercase letter aor something after f. That's not the case. This regex will match any character not in that range, including letters, numbers, punctuation, and spaces.

pattern: [^dw] 
string:   abc defg h . i , j - klmnopqrstuvw xyz 
matches: ^^^ ^ ^ ^ ^ ^^^

( Example )

Leveling up tasks:

Use the "not" sign ^in square brackets to match all the words below that do not end in y:

pattern:
string: day dog hog hay bog bay ray rub 
matches:      ^^^ ^^^ ^^^ ^^^

( Solution ) Write a regular expression using a range and a "not" sign ^to find all years between 1977 and 1982 (inclusive):

pattern:
string: 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
matches:            ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^

( Solution ) Write a regular expression to find all characters that are not a "not" sign character ^:

pattern:
string:   abc1 ^ 23*() 
matches: ^^^^ ^^^^^

( Solution )

Step 7: Character Classes

Character classes are even simpler than character ranges. Different regular expression engines have different classes available, so I'll only cover the main ones here. (Check which version of regex you are using, because there may be more of them - or they may be different from those shown here.) Character classes work almost like ranges, but you cannot specify the 'start' and 'end' values:

Class	symbols
`\d`	"numbers"`[0-9]`
`\w`	"word symbols"`[A-Za-z0-9_]`
`\s`	"spaces"`[ \t\r\n\f]`

The "word" character class \wis especially useful because this character set is often required for valid identifiers (variable names, function names, etc.) in various programming languages. We can use \wto simplify the regular expression we saw earlier:

pattern: \\[az] 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:   ^^ ^^ ^^ ^^

Using \wwe can write like this:

pattern: \\\w 
string: ` \n `, ` \r `, ` \t `, and ` \f ` are whitespace characters, ` \. `, ` \\ ` and ` \[ ` are not .
matches:   ^^ ^^ ^^ ^^

( Example )

2 tasks for luck:

As you and I know, in Java, an identifier (name of a variable, class, function, etc.) can only begin with the letter a- zA- Z, dollar sign $or underscore _. ( underlining is, of course, bad style, but the compiler skips it, translator’s note ). The rest of the characters must be "word" characters \w. Using one or more character classes, create a regular expression to search for valid Java identifiers among the following three-character sequences:

pattern:
string:   __e $12 .x2 foo Bar 3mm
matches: ^^^ ^^^ ^^^ ^^^

( Solution ) US Social Security Numbers (SSN) are 9-digit numbers in the format XXX-XX-XXXX, where each X can be any digit [0-9]. Using one or more character classes, write a regular expression to find correctly formatted SSNs in the list below:

pattern:
string: 113-25=1902 182-82-0192 H23-_3-9982 1I1-O0-E38B
matches:              ^^^^^^^^^^^

( Solution ) RegEx: 20 short steps to master regular expressions. Part 2. 20 short steps to master regular expressions. Part 3. RegEx: 20 short steps to master regular expressions. Part 4.

Comments

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION

RegEx: 20 short steps to master regular expressions. Part 1

Step 1: Why use regular expressions

Step 2: Square Brackets[]

Training objectives:

Step 3: Escape Sequences

In the meantime, let’s reinforce the material with a couple of simple puzzles:

Step 4: look for "any" character using a dot.

Step 5: Character Ranges

Let's try them in practice and solve a couple of problems:

Step 6: "not", caret, circumflex, caret... symbol^

Leveling up tasks:

Step 7: Character Classes

2 tasks for luck:

Step 2: Square Brackets`[]`

Step 4: look for "any" character using a dot`.`

Step 6: "not", caret, circumflex, caret... symbol`^`