JavaRush /Java Blog /Random EN /RegEx: 20 short steps to master regular expressions. Part...
Artur
Level 40
Tallinn

RegEx: 20 short steps to master regular expressions. Part 2

Published in the Random EN group
RegEx: 20 short steps to master regular expressions. Part 1 Original here In the last part we mastered the simplest regular expressions, and have already learned something. In this part we will study slightly more complex designs, but believe me, it will not be as difficult as it might seem. RegEx: 20 short steps to master regular expressions.  Part 2 - 1So let's continue!

Step 8: Star *and Plus Sign+

RegEx: 20 short steps to master regular expressions.  Part 2 - 2So far, we have more or less only been able to match strings of a given length. But in the latest problems we've approached the limit of what we can do with the notation we've seen so far. Let's assume, for example, that we are not limited to 3-character Java identifiers, but that we can have identifiers of any length. A solution that may have worked in the previous example will not work in the following example:
pattern: [a-zA-Z_$]\w\w 
string:   __e $12 3 3.2 fo Bar r a23 mm ab x
matches: ^^^ ^^^ ^^^ ^^^  
( Example ) notethat when an identifier is valid but longer than 3 characters, only the first three characters are matched. And when the identifier is valid, but contains less than 3 characters, then regex does not find it at all! The problem is that bracketed expressions []match exactly one character, as do character classes such as \w. This means that any matches in the above regular expression must be exactly three characters long. So it doesn't work as well as we might have hoped. *The special characters and can help here +. These are modifiers that can be added to the right of any expression to match that expression more than once. The Kleene Star (or "asterisk") *will indicate that the previous token must be matched any number of times, including zero times. The plus sign +will indicate that you need to search one or more times. Thus, the expression that precedes +is mandatory (at least once), while the expression that precedes *is optional, but when it appears, it can appear any number of times. Now, with this knowledge, we can correct the above regular expression:
pattern: [a-zA-Z_$]\w* 
string:   __e $123 3.2 fo Barr a23mm ab x 
matches: ^^^ ^^^^ ^^ ^^^^ ^^^^^ ^^ ^ 
( Example ) Now we match valid identifiers of any length! Bingo! But what would happen if we used ? +instead of *?
pattern: [a-zA-Z_$]\w+ 
string:   __e $123 3.2 fo Barr a23mm ab x
matches: ^^^ ^^^^ ^^ ^^^^ ^^^^^ ^^ 
( Example ) We missed the last match, х. This is because it requires +at least one character to be matched, but since the parenthesized expression []preceding \w+has already 'eaten' the character x, there are no more characters available, so the match fails. When can we use +? When we need to find at least one match, but it doesn’t matter how many times a given expression must match. For example, if we want to find any numbers containing a decimal point:
pattern: \d*\.\d+ 
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 
matches: ^^^^^ ^^ ^^^ ^^^^ ^^^^^ ^^^^^^^  
( Example ) notethat by making the numbers to the left of the decimal point optional, we were able to find both 0.011 and .2. To do this, we needed to match exactly one decimal point with \.and at least one digit to the right of the decimal point with \d+. The above regular expression will not match a number like 3., because we need at least one digit to the right of the decimal point to match.

As usual, let’s solve a couple of simple problems:

Find all the English words in the passage below.
pattern:
string: 3 plus 3 is six but 4 plus three is 7
matches:    ^^^^ ^^ ^^^ ^^^ ^^^^ ^^^^^ ^^ 
( Solution ) Find all file size symbols in the list below. File sizes will consist of a number (with or without a decimal point) followed by KB, MB, GBor TB:
pattern:
string:   11TB 13 14.4MB 22HB 9.9GB TB 0KB 
matches: ^^^^ ^^^^^^ ^^^^^ ^^^  
( Solution )

Step 9: "optional" question mark?

RegEx: 20 short steps to master regular expressions.  Part 2 - 3Have you already written regex to solve the last problem? Did it work? Now try applying it here:
pattern:
string: 1..3KB 5...GB ..6TB
matches:  
Obviously, neither of these designations is a valid file size, so a good regular expression should not match either of them. The solution I wrote to solve the last problem matches them all, at least in part:
pattern: \d+\.*\d*[KMGT]B 
string:   1..3KB  5...GB .. 6TB 
matches: ^^^^^^ ^^^^^^ ^^^ 
( Example ) So what's the problem? In fact, we only need to find one decimal point, if there is one. But *it allows any number of matches, including zero. Is there a way to match only zero times or one time? But no more than once? Of course have. "optional" ?is a modifier that matches zero or one of the preceding characters, but no more:
pattern: \d+\.?\d*[KMGT]B 
string: 1.. 3KB 5...GB .. 6TB 
matches:     ^^^ ^^^ 
( Example ) We are closer to a solution here, but this is not quite what we need. We'll see how to fix this in a few steps a little later.

In the meantime, let's solve this problem:

In some programming languages ​​(e.g. Java), some integer and floating point (dot) numbers may be followed by l/ Land f/ Fto indicate that they should be treated as long/float (respectively) rather than as regular int/double. Find all valid "long" numbers in the line below:
pattern:
string:   13L long 2l 19 L lL 0 
matches: ^^^ ^^ ^^ ^ 
( Solution )

Step 10: "or" sign|

RegEx: 20 short steps to master regular expressions.  Part 2 - 4In step 8, we had some difficulty finding the different types of floating point numbers:
pattern: \d*\.\d+ 
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 
matches: ^^^^^ ^^ ^^^ ^^^^ ^^^^^ ^^^^^^^  
The above pattern matches numbers with a decimal point and at least one digit to the right of the decimal point. But what if we also want to match strings like 0.? (No numbers to the right of the decimal point.) We could write a regular expression like this:
pattern: \d*\.\d* 
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. . 
matches: ^^^^^ ^^ ^^^ ^^^^ ^^^^^ ^^^^^^^ ^^ ^ 
( Example ) This matches 0., but it also matches a single point ., as you can see above. Actually what we are trying to match are two different string classes:
  1. numbers with at least one digit to the right of the decimal point
  2. numbers with at least one digit to the left of the decimal point
Let's write the following 2 regular expressions, independent of each other:
pattern: \d*\.\d+ 
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. .
matches: ^^^^^ ^^ ^^^ ^^^^ ^^^^^ ^^^^^^^  
pattern: \d+\.\d* 
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. .
matches: ^^^^^ ^^^ ^^^^ ^^^^^ ^^^^^^^ ^^ 
We see that in none of these cases the substrings 42, 5, 6or .are found by the engine. To obtain the required result, it would not hurt us to combine these regular expressions. How can we achieve this? The "or" sign |allows us to specify several possible sequences of matches at once in a regular expression. Just as []the "or" sign allows us to specify alternative single characters, |we can specify alternative multi-character expressions. For example, if we wanted to find "dog" or "cat", we could write something like this:
pattern: \w\w\w 
string:   Obviously , a dog is a better pet tha n a cat .
matches: ^^^^^^^^^ ^^^ ^^^^^^ ^^^ ^^^ ^^^ 
( Example ) ... but this matches all triple character sequences of class "word". But "dog" and "cat" don't even have letters in common, so square brackets won't help us here. Here's the simplest regular expression we could use that matches both and only these two words:
pattern: dog|cat 
string: Obviously, a dog is a better pet than a cat .
matches:               ^^^ ^^^ 
( Example ) The regular expression engine first tries to match the entire sequence to the left of the character |, but if it fails, it then tries to match the sequence to the right of the character |. Multiple characters |can also be chained to match more than two alternative sequences:
pattern: dog|cat|pet 
string: Obviously, a dog is a better pet than a cat .
matches:               ^^^ ^^^ ^^^ 
( Example )

Now let’s solve another couple of problems to better understand this step:

Use the sign |to correct the decimal regular expression above to produce a result like this:
pattern:
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. .
matches: ^^^^^ ^^ ^^^ ^^^^ ^^^^^ ^^^^^^^ ^^ 
( Solution ) Use sign |, character classes, "optional" ?, etc. to create a single regular expression that matches both integer and floating point (dot) numbers, as discussed in the problem at the end of the previous step (this problem a little more complicated, yes ;))
pattern:
string:   42L 12 x 3.4f 6l 3.3 0F LF .2F 0. 
matches: ^^^ ^^ ^^^^ ^^ ^^^ ^^ ^^^ ^^  
( Solution ) 20 short steps to master regular expressions. Part 3 RegEx: 20 short steps to master regular expressions. Part 4
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION