JavaRush /Java Blog /Random EN /RegEx: 20 short steps for mastering regular expressions. ...
Artur
Level 40
Tallinn

RegEx: 20 short steps for mastering regular expressions. Part 2

Published in the Random EN group
RegEx: 20 short steps for mastering regular expressions. Part 1 Original here In the last part we have mastered the simplest regular expressions, and have already learned something. In this part, we will study slightly more complex designs, but believe me, it will not be as difficult as it might seem. RegEx: 20 short steps for mastering regular expressions.  Part 2 - 1So, let's continue!

Step 8: Asterisk *and Plus Sign+

RegEx: 20 short steps for mastering regular expressions.  Part 2 - 2So far, we've more or less been able to match only strings of a given length. But in recent problems, we have approached the limit of what we can do with the notation we have seen so far. Suppose, for example, that we are not limited to 3-character Java identifiers, but that we can have identifiers of any length. A solution that might have worked in the previous example will not work in the following example:
pattern: [a-zA-Z_$]\w\w 
string:   __e $12 3 3.2 fo Bar r a23 mm ab x
matches: ^^^ ^^^ ^^^ ^^^  
( Example ) notethat when the identifier is valid but longer than 3 characters, only the first three characters are matched. And when an id is valid but less than 3 characters long, regex doesn't find it at all! The problem is that bracketed expressions []match exactly one character, as do character classes like \w. This means that any matches in the above regular expression must be exactly three characters long. So it doesn't work as we might hope. *The special characters and can help here +. These are modifiers that can be added to the right of any expression to match that expression more than once. Kleene's star (or "asterisk")*specifies to match the previous token any number of times, including zero times. The plus sign +will indicate to search one or more times. Thus, the expression that precedes +is mandatory (at least once), while the expression that precedes *is optional, but when it appears, it can appear any number of times. Now, with this knowledge, we can fix the above regular expression:
pattern: [a-zA-Z_$]\w* 
string:   __e $123 3.2 fo Barr a23mm ab x 
matches: ^^^ ^^^^ ^^ ^^^^ ^^^^^ ^^ ^ 
( Example ) We now match valid IDs of any length! Bingo! But what would happen if we used +instead of *?
pattern: [a-zA-Z_$]\w+ 
string:   __e $123 3.2 fo Barr a23mm ab x
matches: ^^^ ^^^^ ^^ ^^^^ ^^^^^ ^^ 
( Example ) We missed the last match, х. This is because +at least one character is required to match, but since the parenthesized expression []preceding \w+'has already 'eaten' the character x, there are no more characters available, so the match fails. When can we use +? When we need to find at least one match, but it doesn't matter how many times the given expression must match. For example, if we want to find any numbers containing a decimal point:
pattern: \d*\.\d+ 
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 
matches: ^^^^^ ^^ ^^^ ^^^^ ^^^^^ ^^^^^^^  
( Example ) notethat by making the numbers to the left of the decimal point optional, we were able to find both 0.011 and .2. To do this, we needed to match exactly one decimal point with \.and at least one digit to the right of the decimal point with \d+. The above regular expression will not match a number like 3., because we need at least one digit to the right of the decimal point to match.

By tradition, we will solve a couple of simple problems:

Find all the English words in the passage below.
pattern:
string: 3 plus 3 is six but 4 plus three is 7
matches:    ^^^^ ^^ ^^^ ^^^ ^^^^ ^^^^^ ^^ 
( Solution ) Find all file size designations in the list below. File sizes will consist of a number (with or without a decimal point) followed by KB, MB, GBor TB:
pattern:
string:   11TB 13 14.4MB 22HB 9.9GB TB 0KB 
matches: ^^^^ ^^^^^^ ^^^^^ ^^^  
( Decision )

Step 9: "optional" question mark?

RegEx: 20 short steps for mastering regular expressions.  Part 2 - 3Have you already written a regex for the last task? Did it work? Now try applying it here:
pattern:
string: 1..3KB 5...GB ..6TB
matches:  
Obviously, none of these designations are valid file sizes, so a good regex should not match any of them. The solution I wrote for the last problem matches them all, at least in part:
pattern: \d+\.*\d*[KMGT]B 
string:   1..3KB  5...GB .. 6TB 
matches: ^^^^^^ ^^^^^^ ^^^ 
( Example ) So what's the problem? Actually, we only need to find one decimal point, if there is one. But *it allows any number of matches, including zero. Is there a way to only match zero times or one time? But not more than once? Of course have. "optional" ?is a modifier that matches zero or one of the previous characters, but no more:
pattern: \d+\.?\d*[KMGT]B 
string: 1.. 3KB 5...GB .. 6TB 
matches:     ^^^ ^^^ 
( Example ) We're getting closer to the solution here, but it's not quite what we need. A little later, we will see how to fix this, in a few steps.

In the meantime, let's solve this problem:

In some programming languages ​​(such as Java), some integers and floating point (point) numbers may be followed by l/ Land f/ Fto indicate that they should be treated as long / float (respectively) and not as regular int / double. Find all valid "long" numbers in the line below:
pattern:
string:   13L long 2l 19 L lL 0 
matches: ^^^ ^^ ^^ ^ 
( Decision )

Step 10: "or" sign (or)|

RegEx: 20 short steps for mastering regular expressions.  Part 2 - 4In step 8, we had some difficulty finding different types of floating point numbers:
pattern: \d*\.\d+ 
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 
matches: ^^^^^ ^^ ^^^ ^^^^ ^^^^^ ^^^^^^^  
The pattern above matches numbers with a decimal point and at least one digit to the right of the decimal point. But what if we also want to match strings like 0.? (No digits to the right of the decimal point.) We could write a regular expression like this:
pattern: \d*\.\d* 
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. . 
matches: ^^^^^ ^^ ^^^ ^^^^ ^^^^^ ^^^^^^^ ^^ ^ 
( Example ) This matches 0., but it also matches a single dot ., as you can see above. Actually what we are trying to match is two different string classes:
  1. numbers with at least one digit to the right of the decimal point
  2. numbers with at least one digit to the left of the decimal point
Let's write the following 2 regular expressions independent of each other:
pattern: \d*\.\d+ 
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. .
matches: ^^^^^ ^^ ^^^ ^^^^ ^^^^^ ^^^^^^^  
pattern: \d+\.\d* 
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. .
matches: ^^^^^ ^^^ ^^^^ ^^^^^ ^^^^^^^ ^^ 
We see that in none of these cases the substrings 42, 5, 6or .are found by the engine. To obtain the desired result, it would not hurt us to combine these regular expressions. How can we achieve this? The "or" sign |allows us to specify several possible sequences of matches in the regular expression at once. Just as it allows us to specify alternative single characters, we can []use the "or" sign to specify alternative multi-character expressions. |For example, if we wanted to find "dog" or "cat", we could write something like this:
pattern: \w\w\w 
string:   Obviously , a dog is a better pet tha n a cat .
matches: ^^^^^^^^^ ^^^ ^^^^^^ ^^^ ^^^ ^^^ 
( Example ) ... but this matches all ternary character sequences of class "word". But "dog" and "cat" don't even have letters in common, so the square brackets won't help us here either. Here is the simplest regex we could use that matches both and only those two words:
pattern: dog|cat 
string: Obviously, a dog is a better pet than a cat .
matches:               ^^^ ^^^ 
( Example ) The regular expression engine first tries to match the whole sequence to the left of the |, but if it fails, then it tries to match the sequence to the right of the |. Multiple characters |can also be chained to match more than two alternative sequences:
pattern: dog|cat|pet 
string: Obviously, a dog is a better pet than a cat .
matches:               ^^^ ^^^ ^^^ 
( Example )

Now let's solve another couple of problems in order to better understand this step:

Use the sign |to correct the decimal regular expression above and get this result:
pattern:
string:   0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. .
matches: ^^^^^ ^^ ^^^ ^^^^ ^^^^^ ^^^^^^^ ^^ 
( Solution ) Use the sign |, character classes, "optional" ?, etc. to create a single regular expression that matches both integers and floating point (point) numbers, as discussed in the problem at the end of the previous step (this problem a little more complicated, yes ;))
pattern:
string:   42L 12 x 3.4f 6l 3.3 0F LF .2F 0. 
matches: ^^^ ^^ ^^^^ ^^ ^^^ ^^ ^^^ ^^  
( Solution ) 20 short steps to master regular expressions. Part 3 RegEx: 20 Short Steps to Master Regular Expressions. Part 4
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION