JavaRush /Java Blog /Random EN /RegEx: 20 short steps for mastering regular expressions. ...

Level 40

Tallinn

8 August 2023
186 views
0 comments

RegEx: 20 short steps for mastering regular expressions. Part 3

RegEx: 20 short steps for mastering regular expressions. Part 1. RegEx: 20 Short Steps to Master Regular Expressions. Part 2. In this part we will move on to things a little more complicated. But, to master them, as before, will not be difficult. I repeat that RegEx is actually easier than it might seem at first, and you don't need to be seven spans in your forehead to master it and start using it in practice. The English original of this article is here . 20 short steps for mastering regular expressions. Part 3 - 1

20 short steps for mastering regular expressions. Part 3 - 1

Step 11: Parentheses `()`as Capturing Groups

20 short steps for mastering regular expressions. Part 3 - 2

In the last task, we were looking for different kinds of integer values and floating point (point) numeric values. But the regex engine didn't differentiate between these two types of values because everything was captured in one big regex. We can tell the regex engine to distinguish between different kinds of matches by enclosing our mini-patterns in parentheses:

pattern: ([AZ])|([az]) 
string:   The current President of Bolivia is Evo Morales .
matches: ^^^ ^^^^^^^ ^^^^^^^^^^ ^^ ^^^^^^^ ^^ ^^^ ^^^^^^^^ 
group:    122 2222222 122222222 22 1222222 22 122 1222222

( Example ) The above regular expression defines two capturing groups that are indexed starting from 1. The first capturing group matches any single uppercase letter, and the second capturing group matches any single lowercase letter. By using the 'or' sign |and parentheses ()as a capturing group, we can define a single regular expression that matches multiple kinds of strings. If we apply this to our regex to search for long / float from the previous part of the article, then the regex engine will capture the appropriate matches in the appropriate groups. By checking which group a substring corresponds to, we can immediately determine whether it is a float value or a long value:

pattern: (\d*\.\d+[fF]|\d+\.\d*[fF]|\d+[fF])|(\d+[lL]) string: 42L 
12   x 3.4f 6l 3.3 0F LF .2F0 .
matches: ^^^ ^^^^ ^^ ^^ ^^^ 
group:    222 1111 22 11 111

( Example ) This regular expression is quite complex, and to better understand it, let's break it down and look at each of these patterns:

( // matches any "float" substring
  \d*\.\d+[fF]
  |
  \d+\.\d*[fF]
  |
  \d+[fF]
)
| //OR
( // matches any "long" substring
  \d+[lL]
)

Sign |and parenthesized capturing groups ()allow us to match different types of substrings. In this case, we are matching either "float" floating point numbers or "long" long integers.

(
  \d*\.\d+[fF] // 1+ digits to the right of the decimal point
  |
  \d+\.\d*[fF] // 1+ digits to the left of the decimal point
  |
  \d+[fF] // no dot, only 1+ digits
)
|
(
  \d+[lL] // no dot, only 1+ digits
)

In the "float" capture group, we have three options: numbers with at least 1 digit to the right of the decimal point, numbers with at least one digit to the left of the decimal point, and numbers without a decimal point. Any of them are "float", provided that the letters "f" or "F" are added to their end. Inside the "long" capturing group, we only have one option - we must have 1 or more digits followed by an "l" or "L" character. The regex engine will look for these substrings in the given string and index them into the appropriate capturing group. notethat we don't match any of the numbers to which none of "l", "L", "f", or "F" is added. How should these numbers be classified? Well, if they have a decimal point, the Java language defaults to "double". Otherwise they must be "int".

Let's consolidate the past with a couple of tasks:

Add two more capturing groups to the above regular expression so that it also classifies doubles or ints. (This is another tricky question, don't be discouraged if it takes a while, see my solution as a last resort.)

pattern:
string:   42L 12 x 3.4f 6l 3.3 0F LF .2F 0. 
matches: ^^^ ^^ ^^^^ ^^ ^^^ ^^ ^^^ ^^ 
group:    333 44 1111 33 222 11 111 22

( Solution ) The next problem is a bit easier. Use capturing groups in parentheses (), the 'or' sign, |and character ranges to sort the following ages: "drinkable in the US". (>= 21) and "not allowed to drink in the USA" (<21):

pattern:
string:   7 10 17 18 19 20 21 22 23 24 30 40 100 120 
matches: ^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^^ ^^^ 
group:    2 22 22 22 22 22 11 11 11 11 11 11 111 111

( Decision )

Step 12: Identify More Specific Matches First

20 short steps for mastering regular expressions. Part 3 - 3

You may have had some trouble with the last task if you tried to define "legitimate drinkers" as the first capture group and not the second. To understand why, let's look at another example. Suppose we want to record separately last names containing less than 4 characters and last names containing 4 or more characters. Let's give shorter names to the first capture group, and see what happens:

pattern: ([AZ][az]?[az]?)|([AZ][az][az][az]+) string 
:   Kim Job s Xu Clo yd Moh r Ngo Roc k.
matches: ^^^ ^^^ ^^ ^^^ ^^^ ^^^ ^^^ 
group:    111 111 11 111 111 111 111

( Example ) By default, most regex engines use greedy matching with the basic characters we've seen so far. This means that the regex engine will capture the longest group defined as early as possible in the provided regex. So while the second group above could capture more characters in names such as "Jobs" and "Cloyd" for example, but since the first three characters of those names have already been captured by the first capture group, they cannot be captured again by the second. Now let's make a small fix - just change the order of the capture groups, putting the more specific (longer) group first:

pattern: ([AZ][az][az][az]+)|([AZ][az]?[az]?) 
string:   Kim Jobs Xu Cloyd Mohr Ngo Rock .
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^^ 
group:    222 1111 22 11111 1111 222 1111

( Example )

Task... only one this time :)

"More specific" pattern almost always means "longer". Suppose we want to find two kinds of "words": first those that start with vowels (more specifically), then those that don't start with vowels (any other word). Try writing a regular expression to capture and identify strings that match those two groups. (Groups below are lettered, not numbered. You must determine which group should match first and which should match second.)

pattern:
string:   pds6f uub 24r2gp ewqrty l ui_op 
matches: ^^^^^ ^^^ ^^^^^^ ^^^^^^ ^ ^^^^^ 
group:    NNNNN VVV NNNNNN VVVVVV N VVVVV

( Solution ) In general, the more precise your regular expression is, the longer it will end up. And the more accurate it is, the less likely you are to grab what you don't need. So while they may look intimidating, longer regexes ~= better regexes. Unfortunately .

Step 13: Curly Braces `{}`for a Specific Number of Repetitions

20 short steps for mastering regular expressions. Part 3 - 4

In the last name example from the previous step, we had 2 almost repeating groups in the same template:

pattern: ([AZ][az][az][az]+)|([AZ][az]?[az]?) 
string:   Kim Jobs Xu Cloyd Mohr Ngo Rock .
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^^ 
group:    222 1111 22 11111 1111 222 1111

For the first group, we needed surnames with four or more letters. The second group was to capture last names with three letters or less. Is there any easier way to write this than repeating these [a-z]groups over and over again? Exists if you use curly braces for it {}. Curly braces {}allow us to specify a minimum and (optionally) maximum number of matches for the previous character or capturing group. There are three use cases {}:

{X} // matches exactly X times
{X,} // matches >= X times
{X,Y} // matches >= X and <= Y times

Here are examples of these three different syntaxes:

pattern: [az]{11} 
string:   humuhumunuk unukuapua'a.
matches: ^^^^^^^^^^^

( Example )

pattern: [az]{18,} 
string:   humuhumunukunukuapua 'a.
matches: ^^^^^^^^^^^^^^^^^^^^

( Example )

pattern: [az]{11,18} 
string:   humuhumunukunukuap ua'a.
matches: ^^^^^^^^^^^^^^^^^^

( Example ) In the examples above, there are several points that should benote:. First, using the {X} notation, the previous character or group will match exactly that number (X) times. If there are more characters in "word" (than the number X) that could match the pattern (as shown in the first example), they will not be included in the match. If the number of characters is less than X, then the full match will fail (try changing 11 to 99 in the first example). Second, the notation {X,} and {X,Y} are greedy. They will try to match as many characters as possible while still satisfying the given regular expression. If you specify {3,7}, then 3 to 7 characters can be matched, and if the next 7 characters are valid, then all 7 characters will be matched. If you specify {1,} and the next 14,000 characters all match, then all 14,000 of those characters will be included in the corresponding string. How can we use this knowledge to rewrite our expression above? The simplest improvement might be to replace neighboring groups[a-z]to [a-z]{N}, where N is chosen appropriately:

pattern: ([AZ][az]{2}[az]+)|([AZ][az]?[az]?)

... but that doesn't make things much better. Look at the first capturing group: we have [a-z]{2}(which matches exactly 2 lowercase letters) followed by [a-z]+(which matches 1 or more lowercase letters). We can simplify this by asking for 3 or more lowercase letters using curly braces:

pattern: ([AZ][az]{3,})|([AZ][az]?[az]?)

The second capture group is different. We need at most three characters in these last names, which means we have an upper limit, but our lower limit is zero:

pattern: ([AZ][az]{3,})|([AZ][az]{0,2})

Specificity is always better when using regular expressions, so it would be wise to stop there, but I can't help but notice that these two ranges of characters ( [AZ]and [az]) next to each other almost look like the "word character" class, \w( [A-Za-z0-9_]) . If we are sure that our data only contains well-formatted last names, then we could simplify our regular expression and write simply:

pattern: (\w{4,})|(\w{1,3})

The first group captures any sequence of 4 or more "word characters" ( [A-Za-z0-9_]), and the second group captures any sequence of 1 to 3 "word characters" (inclusive). Will it work?

pattern: (\w{4,})|(\w{1,3}) 
string:   Kim Jobs Xu Cloyd Mohr Ngo Rock .
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^^ 
group:    222 1111 22 11111 1111 222 1111

( Example ) It worked! How about this approach? And it's much cleaner than our previous example. Since the first capturing group matches all last names with four or more characters, we could even change the second capturing group to just \w+, as this would allow us to capture all remaining last names (with 1, 2, or 3 characters):

pattern: (\w{4,})|(\w+) 
string:   Kim Jobs Xu Cloyd Mohr Ngo Rock .
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^^ 
group:    222 1111 22 11111 1111 222 1111

( Example )

Let's help the brain learn this, and solve the following 2 tasks:

Use curly braces {}to rewrite the regular expression for finding the social security number from step 7:

pattern:
string: 113-25=1902 182-82-0192 H23-_3-9982 1I1-O0-E38B
matches:              ^^^^^^^^^^^

( Solution ) Assume that the password strength system of a website requires user passwords to be between 6 and 12 characters long. Write a regular expression that flags bad passwords in the list below. Each password is enclosed in parentheses ()for easy matching, so make sure the regular expression starts and ends with literals (and )symbols. Hint: make sure you forbid literal brackets in passwords with [^()]or similar, otherwise you'll end up matching the entire string!

pattern:
string:   (12345) (my password) (Xanadu.2112) (su_do) (OfSalesmen!)
matches: ^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^

( Decision )

Step 14: `\b`Zero Width Border Character

20 short steps for mastering regular expressions. Part 3 - 5

The last task was quite difficult. But what if we make it a little more complicated by enclosing passwords in quotation marks ""instead of parentheses ()? Can we write a similar solution by simply replacing all parentheses with quotation marks?

pattern: \"[^"]{0,5}\"|\"[^"]+\s[^"]*\" string: "12345" " 
my   password" "Xanadu.2112 " " su_do" " OfSalesmen!"
matches: ^^^^^^^ ^^^^^^^^^^^^^ ^^^ ^^^

( Example ) It turned out not very impressive. Have you already guessed why? The problem is that we are looking for wrong passwords here. "Xanadu.2112" is a good password, so when the regex understands that this sequence does not contain spaces or literal characters ", it surrenders just before the character "that limits the password on the right side. (Because we specified that characters "cannot be found inside passwords using [^"].) Once the regex engine is satisfied that those characters do not match a particular regular expression, it starts up again, exactly where it left off - where the character was ". which limits "Xanadu.2112" on the right. From there it sees a single space character,"- this is the wrong password for him! In general, he finds this sequence " "and moves on. This is not at all what we want to get... It would be great if we could specify that the first character of the password should not be a space. Is there a way to do this? (By now, you've probably figured out that the answer to all my rhetorical questions is "yes".) Yes! There is such a way! Many regular expression engines provide an escape sequence such as "word boundary" \b. A "word boundary" \bis a zero-width escape sequence that, oddly enough, matches a word boundary. Remember that when we say "word" we mean any sequence of characters in the class \w, as well as[A-Za-z0-9_]. A word boundary match means that the character immediately before or immediately after the sequence \bmust be неa word character. However, when matching, we do not include this character in our captured substring. This is the zero width. To see how this works, let's look at a small example:

pattern: \b[^ ]+\b 
string:   Ve still vant ze money , Lebowski .
matches: ^^ ^^^^^ ^^^^ ^^ ^^^^^ ^^^^^^^^

( Example ) The sequence [^ ]must match any character that is not a literal space character. So why doesn't this match a comma ,after money or a period " .after Lebowski? This is because the comma ,and period .are not word characters, so boundaries are created between word characters and non-word characters. They appear between ythe end of the word money and the comma ,that follows it, and between " ithe word Lebowski and the dot .(full stop/period) that follows it. The regular expression matches on the boundaries of these words (but not on non-word characters, which only help to define them). But what happens if we don't include the sequence\bto our template?

pattern: [^ ]+ 
string:   Ve still vant ze money, Lebowski. 
matches: ^^ ^^^^^ ^^^^ ^^ ^^^^^^ ^^^^^^^^^

( Example ) Yep, now we find those punctuation marks too. Now let's use word boundaries to fix the regular expression for quoted passwords:

pattern: \"\b[^"]{0,5}\b\"|\"\b[^"]+\s[^"]*\b\" string: "12345" "my 
password   " " Xanadu.2112" "su_do" "OfSalesmen!"
matches: ^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^

( Example ) By placing word boundaries inside quotes ("\b ... \b"), we are effectively saying that the first and last characters of matching passwords must be "word characters". So this works fine here, but won't work as well if the first or last character of the user's password is not a word character:

pattern: \"\b[^"]{0,5}\b\"|\"\b[^"]+\s[^"]*\b\"
string: "thefollowingpasswordistooshort" "C++"
matches:

( Example ) See how the second password isn't marked as "invalid" even if it's clearly too short. You must becarefulwith sequences \bsince they only match boundaries between characters \wand not \w. In the example above, since we allowed non characters in passwords \w, the boundary between \and the first/last character of the password is not guaranteed to be a word boundary \b.

At the end of this step, we will solve only one simple task:

Word boundaries are useful in syntax highlighting engines when we want to match a specific sequence of characters, but want to make sure they only occur at the beginning or end of a word (or by themselves). Let's say we're writing syntax highlighting and we want to highlight the word var, but only when it appears on its own (without touching other characters in the word). Can you write a regular expression for this? Of course you can, it's a very simple task ;)

pattern:
string:   var varx _var ( var j) barvarcar * var var -> { var }
matches: ^^^ ^^^ ^^^ ^^^ ^^^

( Decision )

Step 15: "caret" `^`as "beginning of line" and dollar sign `$`as "end of line"

20 short steps for mastering regular expressions. Part 3 - 6

The word boundary sequence \b(from the last step of the previous article) is not the only zero-width special sequence available for use in regular expressions. The two most popular of these are "caret" ^- "beginning of line" and the dollar sign $- "end of line". Including one of these in your regular expressions means that the given match must appear at the beginning or end of the original string:

pattern: ^start|end$ 
string:   start end start end start end start end 
matches: ^^^^^ ^^^

( Example ) If your string contains line breaks, it ^startwill match "start" at the beginning of any line, and end$will match "end" at the end of any line (though this is hard to show here). These characters are especially useful when working with data that contains delimiters. Let's go back to the "file size" problem from step 9 using ^"beginning of line". In this example, our file sizes are separated by spaces " ". So we want each file size to start with a digit preceded by a space character or the beginning of a line:

pattern: (^| )(\d+|\d+\.\d+)[KMGT]B 
string:   6.6KB 1..3KB 12KB 5G 3.3MB KB .6.2TB 9MB .
matches: ^^^^^ ^^^^^ ^^^^^^ ^^^^ 
group:    222 122 1222 12

( Example ) We are already so close to the goal! But you may notice that we still have one small problem: we are matching a space character before the allowed file size. Now we can just ignore this capturing group (1) when our regex engine finds it, or we can use the non-capturing group which we will see in the next step.

In the meantime, let's solve 2 more tasks for tone:

Continuing with our syntax highlighting example from the last step, some syntax highlights will highlight trailing spaces, that is, any spaces that are between a non-whitespace character and the end of a line. Can you write a regex to only highlight trailing spaces?

pattern:
string: myvec <- c(1, 2, 3, 4, 5)  
matches:                          ^^^^^^^

( Solution ) A simple comma-separated value (CSV) parser will look for "tokens" separated by commas. Generally, a space has no meaning unless it is enclosed in quotation marks "". Write a simple CSV parsing regex that matches tokens between commas but ignores (does not capture) white space that is not between quotes.

pattern:
string:   a, "b", "c d",e,f, "g h", dfgi,, k, "", l matches: 
^^ ^^^^ ^^^^^^^^^^^ ^^^ ^^^ ^^^^^^ ^^ ^^^ ^ 
group:    21 2221 2222212121 222221 222211 21 221 2

( Solution ) RegEx: 20 Short Steps to Master Regular Expressions. Part 4

Comments

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION

RegEx: 20 short steps for mastering regular expressions. Part 3

Step 11: Parentheses ()as Capturing Groups

Let's consolidate the past with a couple of tasks:

Step 12: Identify More Specific Matches First

Task... only one this time :)

Step 13: Curly Braces {}for a Specific Number of Repetitions

Let's help the brain learn this, and solve the following 2 tasks:

Step 14: \bZero Width Border Character

At the end of this step, we will solve only one simple task:

Step 15: "caret" ^as "beginning of line" and dollar sign $as "end of line"

In the meantime, let's solve 2 more tasks for tone:

Step 11: Parentheses `()`as Capturing Groups

Step 13: Curly Braces `{}`for a Specific Number of Repetitions

Step 14: `\b`Zero Width Border Character

Step 15: "caret" `^`as "beginning of line" and dollar sign `$`as "end of line"