JavaRush /Java Blog /Random EN /RegEx: 20 short steps to master regular expressions. Part...

Level 40

Tallinn

28 February 2021
49 views
0 comments

RegEx: 20 short steps to master regular expressions. Part 3

RegEx: 20 short steps to master regular expressions. Part 1. RegEx: 20 short steps to master regular expressions. Part 2: In this part we'll move on to things that are a little more complex. But mastering them, as before, will not be difficult. I repeat that RegEx is actually easier than it might seem at first, and you don’t need to be a rocket scientist to master it and start using it in practice. The English original of this article is here . 20 short steps to master regular expressions. Part 3 - 1

20 short steps to master regular expressions. Part 3 - 1

Step 11: Parentheses `()`as Capturing Groups

20 short steps to master regular expressions. Part 3 - 2

In the last problem, we looked for different types of integer values and floating point (dot) numeric values. But the regular expression engine didn't differentiate between these two types of values, since everything was captured in one big regular expression. We can tell the regular expression engine to differentiate between different types of matches if we enclose our mini-patterns in parentheses:

pattern: ([AZ])|([az]) 
string:   The current President of Bolivia is Evo Morales .
matches: ^^^ ^^^^^^^ ^^^^^^^^^ ^^ ^^^^^^^ ^^ ^^^ ^^^^^^^ 
group:    122 2222222 122222222 22 1222222 22 122 1222222

( Example ) The above regular expression defines two capture groups that are indexed starting at 1. The first capture group matches any single uppercase letter, and the second capture group matches any single lowercase letter. By using the 'or' sign |and parentheses ()as a capturing group, we can define a single regular expression that matches multiple kinds of strings. If we apply this to our long/float search regex from the previous part of the article, then the regex engine will capture the corresponding matches in the appropriate groups. By checking which group a substring matches, we can immediately determine whether it is a float value or a long value:

pattern: (\d*\.\d+[fF]|\d+\.\d*[fF]|\d+[fF])|(\d+[lL]) 
string:   42L 12 x 3.4f 6l 3.3 0F LF .2F 0.
matches: ^^^ ^^^^ ^^ ^^ ^^^ 
group:    222 1111 22 11 111

( Example ) This regular expression is quite complex, and to understand it better, let's break it down and look at each of these patterns:

( // matches any "float" substring
  \d*\.\d+[fF]
  |
  \d+\.\d*[fF]
  |
  \d+[fF]
)
| //OR
( // matches any "long" substring
  \d+[lL]
)

The sign |and capturing groups in parentheses ()allow us to match different types of substrings. In this case, we are matching either floating point numbers "float" or long integers "long".

(
  \d*\.\d+[fF] // 1+ digits to the right of the decimal point
  |
  \d+\.\d*[fF] // 1+ digits to the left of the decimal point
  |
  \d+[fF] // no dot, only 1+ digits
)
|
(
  \d+[lL] // no dot, only 1+ digits
)

In the "float" capture group, we have three options: numbers with at least 1 digit to the right of the decimal point, numbers with at least 1 digit to the left of the decimal point, and numbers with no decimal point. Any of them are "floats" as long as they have the letters "f" or "F" appended to the end. Inside the "long" capture group we only have one option - we must have 1 or more digits followed by the character "l" or "L". The regular expression engine will look for these substrings in a given string and index them into the appropriate capture group. notethat we are not matching any of the numbers that do not have any of "l", "L", "f" or "F" added to them. How should these numbers be classified? Well, if they have a decimal point, the Java language defaults to "double". Otherwise they must be "int".

Let’s consolidate what we’ve learned with a couple of puzzles:

Add two more capture groups to the above regex so that it also classifies double or int numbers. (This is another tricky question, don't be discouraged if it takes a while, as a last resort see my solution.)

pattern:
string:   42L 12 x 3.4f 6l 3.3 0F LF .2F 0. 
matches: ^^^ ^^ ^^^^ ^^ ^^^ ^^ ^^^ ^^ 
group:    333 44 1111 33 222 11 111 22

( Solution ) The next problem is a little simpler. Use bracketed capture groups (), the 'or' sign, |and character ranges to sort the following ages: "legal to drink in the US." (>= 21) and "not allowed to drink in the USA" (<21):

pattern:
string:   7 10 17 18 19 20 21 22 23 24 30 40 100 120 
matches: ^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^^ ^^^ 
group:    2 22 22 22 22 22 11 11 11 11 11 11 111 111

( Solution )

Step 12: Identify More Specific Matches First

20 short steps to master regular expressions. Part 3 - 3

You may have had some trouble with the last task if you tried to define "legal drinkers" as the first capture group rather than the second. To understand why, let's look at another example. Suppose we want to record separately surnames containing less than 4 characters and surnames containing 4 or more characters. Let's give shorter names to the first capture group and see what happens:

pattern: ([AZ][az]?[az]?)|([AZ][az][az][az]+) 
string:   Kim Job s Xu Clo yd Moh r Ngo Roc k.
matches: ^^^ ^^^ ^^ ^^^ ^^^ ^^^ ^^^ 
group:    111 111 11 111 111 111 111

( Example ) By default, most regular expression engines use greedy matching against the basic characters we've seen so far. This means that the regular expression engine will capture the longest group defined as early as possible in the provided regular expression. So although the second group above could capture more characters in names such as "Jobs" and "Cloyd" for example, but since the first three characters of those names were already captured by the first capture group, they cannot be captured again by the second. Now let's make a small correction - simply change the order of the capture groups, placing the more specific (longer) group first:

pattern: ([AZ][az][az][az]+)|([AZ][az]?[az]?) 
string:   Kim Jobs Xu Cloyd Mohr Ngo Rock .
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^^ 
group:    222 1111 22 11111 1111 222 1111

( Example )

Task... this time only one :)

A "more specific" pattern almost always means "longer". Let's say we want to find two kinds of "words": first those that start with vowels (more specifically), then those that don't start with vowels (any other word). Try writing a regular expression to capture and identify strings that match these two groups. (The groups below are lettered rather than numbered. You must determine which group should correspond to the first and which to the second.)

pattern:
string:   pds6f uub 24r2gp ewqrty l ui_op 
matches: ^^^^^ ^^^ ^^^^^^ ^^^^^^ ^ ^^^^^ 
group:    NNNNN VVV NNNNNN VVVVVV N VVVVV

( Solution ) In general, the more precise your regular expression, the longer it will end up. And the more accurate it is, the less likely it is that you will capture something you don’t need. So while they may look scary, longer regexes ~= better regexes. Unfortunately .

Step 13: Curly braces `{}`for a specific number of repetitions

20 short steps to master regular expressions. Part 3 - 4

In the example with last names from the previous step, we had 2 almost repeating groups in one pattern:

pattern: ([AZ][az][az][az]+)|([AZ][az]?[az]?) 
string:   Kim Jobs Xu Cloyd Mohr Ngo Rock .
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^^ 
group:    222 1111 22 11111 1111 222 1111

For the first group, we needed last names with four or more letters. The second group had to capture surnames with three or fewer letters. Is there any easier way to write this than repeating these [a-z]groups over and over again? Exists if you use curly braces for this {}. Curly braces {}allow us to specify the minimum and (optionally) maximum number of matches of the previous character or capture group. There are three use cases {}:

{X} // matches exactly X times
{X,} // matches >= X times
{X,Y} // matches >= X and <= Y times

Here are examples of these three different syntaxes:

pattern: [az]{11} 
string:   humuhumunuk unukuapua'a.
matches: ^^^^^^^^^^^

( Example )

pattern: [az]{18,} 
string:   humuhumunukunukuapua 'a.
matches: ^^^^^^^^^^^^^^^^^^^^^

( Example )

pattern: [az]{11,18} 
string:   humuhumunukunukuap ua'a.
matches: ^^^^^^^^^^^^^^^^^^

( Example ) There are several points to note in the above examples.note:. First, using {X} notation, the previous character or group will match exactly that number (X) times. If there are more characters in the "word" (than the number X) that could match the pattern (as shown in the first example), then they will not be included in the match. If the number of characters is less than X, then the full match will fail (try changing 11 to 99 in the first example). Second, the notations {X,} and {X,Y} are greedy. They will try to match as many characters as possible while still satisfying the given regular expression. If you specify {3,7} then 3 to 7 characters can be matched and if the next 7 characters are valid then all 7 characters will be matched. If you specify {1,} and all of the next 14,000 characters match, then all 14,000 of those characters will be included in the corresponding string. How can we use this knowledge to rewrite our expression above? The simplest improvement might be to replace the neighboring groups [a-z]with [a-z]{N}, where N is chosen accordingly:

pattern: ([AZ][az]{2}[az]+)|([AZ][az]?[az]?)

...but that doesn't make things much better. Look at the first capture group: we have [a-z]{2}(which matches exactly 2 lowercase letters) followed by [a-z]+(which matches 1 or more lowercase letters). We can simplify this by asking for 3 or more lowercase letters using curly braces:

pattern: ([AZ][az]{3,})|([AZ][az]?[az]?)

The second capture group is different. We need no more than three characters in these last names, which means we have an upper limit, but our lower limit is zero:

pattern: ([AZ][az]{3,})|([AZ][az]{0,2})

Specificity is always better when using regular expressions, so it would be wise to stop there, but I can't help but notice that these two character ranges ( [AZ]and [az]) next to each other look almost like a "word character" class, \w( [A-Za-z0-9_]) . If we were confident that our data only contained well-formatted last names, then we could simplify our regular expression and write simply:

pattern: (\w{4,})|(\w{1,3})

The first group captures any sequence of 4 or more "word characters" ( [A-Za-z0-9_]), and the second group captures any sequence from 1 to 3 "word characters" (inclusive). Will this work?

pattern: (\w{4,})|(\w{1,3}) 
string:   Kim Jobs Xu Cloyd Mohr Ngo Rock .
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^^ 
group:    222 1111 22 11111 1111 222 1111

( Example ) It worked! How about this approach? And it's much cleaner than our previous example. Since the first capture group matches all surnames with four or more characters, we could even change the second capture group to simply \w+, since this would allow us to capture all remaining surnames (with 1, 2, or 3 characters):

pattern: (\w{4,})|(\w+) 
string:   Kim Jobs Xu Cloyd Mohr Ngo Rock .
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^^ 
group:    222 1111 22 11111 1111 222 1111

( Example )

Let's help the brain learn this and solve the following 2 problems:

Use curly braces {}to rewrite the social security number lookup regular expression from step 7:

pattern:
string: 113-25=1902 182-82-0192 H23-_3-9982 1I1-O0-E38B
matches:              ^^^^^^^^^^^

( Solution ) Assume that a website's password strength checker requires user passwords to be between 6 and 12 characters. Write a regular expression that flags the invalid passwords in the list below. Each password is contained in parentheses ()for easy matching, so make sure the regular expression begins and ends with literal (and )symbolic characters. Hint: make sure you disallow literal parentheses in passwords with [^()]or similar, otherwise you'll end up matching the entire string!

pattern:
string:   (12345) (my password) (Xanadu.2112) (su_do) (OfSalesmen!)
matches: ^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^

( Solution )

Step 14: `\b`Zero Width Border Symbol

20 short steps to master regular expressions. Part 3 - 5

The last task was quite difficult. But what if we made it a little more complicated by enclosing passwords in quotes ""instead of parentheses ()? Can we write a similar solution by simply replacing all parenthesis characters with quote characters?

pattern: \"[^"]{0.5}\"|\"[^"]+\s[^"]*\" 
string:   "12345" "my password" "Xanadu.2112 " " su_do" " OfSalesmen!"
matches: ^^^^^^^ ^^^^^^^^^^^^^ ^^^ ^^^

( Example ) It didn't turn out very impressive. Have you already guessed why? The problem is that we are looking for incorrect passwords here. "Xanadu.2112" is a good password, so when the regex realizes that this sequence does not contain spaces or literal characters ", it yields just before the character "that qualifies the password on the right side. (Because we specified that characters "cannot be found inside passwords using [^"].) Once the regular expression engine is satisfied that those characters do not match a particular regular expression, it runs again, exactly where it left off - where the character was ". which limits "Xanadu.2112" on the right. From there he sees one space character, and another character "- for him this is the wrong password! Basically, he finds this sequence " "and moves on. This is not at all what we would like to get... It would be great if we could specify that the first character of the password should not be a space. Is there a way to do this? (By now, you've probably realized that the answer to all my rhetorical questions is "yes.") Yes! There is such a way! Many regular expression engines provide an escape sequence such as "word boundary" \b. "Word boundary" \bis a zero-width escape sequence which, oddly enough, matches a word boundary. Remember that when we say "word", we mean either any sequence of characters in the class \wor [A-Za-z0-9_]. A word boundary match means that the character immediately before or immediately after the sequence \bmust be неa word character. However, when matching, we do not include this character in our captured substring. This is zero width. To see how this works, let's look at a small example:

pattern: \b[^ ]+\b 
string:   Ve still vant ze money , Lebowski .
matches: ^^ ^^^^^ ^^^^ ^^ ^^^^^ ^^^^^^^^

( Example ) The sequence [^ ]must match any character that is not a literal space character. So why doesn't this match the comma ,after money or the period " .after Lebowski? This is because the comma ,and period .are not word characters, so boundaries are created between word characters and non-word characters. They appear between ythe end of the word money and the comma ,that follows it. and between " ithe word Lebowski and the period .(full stop/period) that follows it. The regular expression matches on the boundaries of these words (but not on the non-word characters that only help define them). But what happens if we don't include consistency \bin our template?

pattern: [^ ]+ 
string:   Ve still vant ze money, Lebowski. 
matches: ^^ ^^^^^ ^^^^ ^^ ^^^^^^ ^^^^^^^^^

( Example ) Yeah, now we find these punctuation marks too. Now let's use word boundaries to fix the regex for quoted passwords:

pattern: \"\b[^"]{0.5}\b\"|\"\b[^"]+\s[^"]*\b\" 
string:   "12345" "my password" " Xanadu.2112" "su_do" "OfSalesmen!"
matches: ^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^

( Example ) By placing word boundaries inside quotation marks ("\b ... \b"), we are effectively saying that the first and last characters of matching passwords must be "word characters". So this works fine here, but won't work as well if the first or last character of the user's password is not a word character:

pattern: \"\b[^"]{0.5}\b\"|\"\b[^"]+\s[^"]*\b\"
string: "thefollowingpasswordistooshort" "C++"
matches:

( Example ) See how the second password is not marked as "invalid" even though it is clearly too short. You must becarefulwith sequences \b, since they only match boundaries between characters \wand not \w. In the above example, since we allowed characters not , in passwords \w, the boundary between \and the first/last character of the password is not guaranteed to be a word boundary \b.

To complete this step, we will solve only one simple problem:

Word boundaries are useful in syntax highlighting engines when we want to match a specific sequence of characters, but want to make sure they only occur at the beginning or end of a word (or on their own). Let's say we're writing syntax highlighting and want to highlight the word var, but only when it appears on its own (without touching other characters in the word). Can you write a regular expression for this? Of course you can, it's a very simple task ;)

pattern:
string:   var varx _var ( var j) barvarcar * var var -> { var }
matches: ^^^ ^^^ ^^^ ^^^ ^^^

( Solution )

Step 15: "caret" `^`as "beginning of line" and dollar sign `$`as "end of line"

20 short steps to master regular expressions. Part 3 - 6

The word boundary sequence \b(from the last step of the previous part of the article) is not the only special zero-width sequence available for use in regular expressions. The two most popular ones are "caret" ^- "beginning of line" and dollar sign $- "end of line". Including one of these in your regular expressions means that the match must appear at the beginning or end of the source string:

pattern: ^start|end$ 
string:   start end start end start end start end 
matches: ^^^^^ ^^^

( Example ) If your string contains line breaks, it ^startwill match the sequence "start" at the beginning of any line, and end$will match the sequence "end" at the end of any line (though this is difficult to show here). These symbols are especially useful when working with data that contains delimiters. Let's go back to the "file size" issue from step 9 using ^"start of line". In this example, our file sizes are separated by spaces " ". So we want each file size to start with a number, preceded by a space character or the start of a line:

pattern: (^| )(\d+|\d+\.\d+)[KMGT]B 
string:   6.6KB 1..3KB 12KB 5G 3.3MB KB .6.2TB 9MB .
matches: ^^^^^ ^^^^^ ^^^^^^ ^^^^ 
group:    222 122 1222 12

( Example ) We are already so close to the goal! But you may notice that we still have one small problem: we are matching the space character before the valid file size. Now we can simply ignore this capturing group (1) when our regular expression engine finds it, or we can use a non-capturing group, which we will see in the next step.

In the meantime, let’s solve 2 more problems for tone:

Continuing with our syntax highlighting example from the last step, some syntax highlighting will mark trailing spaces, that is, any spaces that come between a non-whitespace character and the end of the line. Can you write a regex to highlight only trailing spaces?

pattern:
string: myvec <- c(1, 2, 3, 4, 5)  
matches:                          ^^^^^^^

( Solution ) A simple comma-separated value (CSV) parser will look for "tokens" separated by commas. Generally, space has no meaning unless it is enclosed in quotation marks "". Write a simple CSV parsing regular expression that matches tokens between commas, but ignores (does not capture) white space that is not between quotes.

pattern:
string:   a, "b", "c d",e,f, "g h", dfgi,, k, "", l 
matches: ^^ ^^^^ ^^^^^^^^^^ ^^^ ^^^ ^^^^^^ ^^ ^^^ ^ 
group:    21 2221 2222212121 222221 222211 21 221 2

( Solution ) RegEx: 20 short steps to master regular expressions. Part 4.

Comments

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION

RegEx: 20 short steps to master regular expressions. Part 3

Step 11: Parentheses ()as Capturing Groups

Let’s consolidate what we’ve learned with a couple of puzzles:

Step 12: Identify More Specific Matches First

Task... this time only one :)

Step 13: Curly braces {}for a specific number of repetitions

Let's help the brain learn this and solve the following 2 problems:

Step 14: \bZero Width Border Symbol

To complete this step, we will solve only one simple problem:

Step 15: "caret" ^as "beginning of line" and dollar sign $as "end of line"

In the meantime, let’s solve 2 more problems for tone:

Step 11: Parentheses `()`as Capturing Groups

Step 13: Curly braces `{}`for a specific number of repetitions

Step 14: `\b`Zero Width Border Symbol

Step 15: "caret" `^`as "beginning of line" and dollar sign `$`as "end of line"