JavaRush /Java Blog /Random EN /RegEx: 20 short steps for mastering regular expressions. ...
Artur
Level 40
Tallinn

RegEx: 20 short steps for mastering regular expressions. Part 4

Published in the Random EN group
RegEx: 20 short steps for mastering regular expressions. Part 1 RegEx: 20 Short Steps to Master Regular Expressions. Part 2 20 short steps for mastering regular expressions. Part 3 This, the final part, in the middle of it will touch on things that are mainly used by the masters of regular expressions. But the material from the previous parts was easy for you, right? So you can handle this material with the same ease! Original here RegEx: 20 short steps for mastering regular expressions.  Part 4 - 1 <h2>Step 16: No capturing groups (?:)</h2> RegEx: 20 short steps for mastering regular expressions.  Part 4 - 2In the two examples in the previous step, we were capturing text that we don't really need. In the "File Sizes" task, we captured the spaces before the first digit of the file sizes, and in the "CSV" task, we captured the commas between each token. We don't need to capture these characters, but we do need to use them to structure our regular expression. These are ideal options for using a non-capturing group, (?:). The non-capturing group does exactly what it sounds like in meaning - it allows characters to be grouped and used in regular expressions, but does not capture them in a numbered group:
pattern: (?:")([^"]+)(?:") 
string: I only want "the text inside these quotes" .
matches:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
group:                 111111111111111111111111111    
( Example ) The regular expression now matches the quoted text as well as the quote characters themselves, but the capturing group only captured the quoted text. Why should we do this? The point is that most regular expression engines allow you to recover text from capturing groups defined in your regular expressions. If we can trim the extra characters we don't want without including them in our capturing groups, then it will make it easier to parse and manipulate the text later. Here's how to clean up the CSV parser from the previous step:
pattern: (?:^|,)\s*(?:\"([^",]*)\"|([^", ]*)) string: a , " 
b   " , " cd " , e , f , " gh ", dfgi ,, k , "", l 
matches: ^ ^ ^^^ ^ ^ ^^^ ^^^^ ^ ^ 
group:    2 1 111 2 2 111 2222 2 2    
( Example ) There are a few things to note here: <mark></mark> First, we no longer capture commas, since we changed the capturing group (^|,)to a non-capturing group (?:^|,). Secondly, we nested a capturing group within a non-capturing group. This is useful when, for example, you need a group of characters to appear in a particular order, but you only care about a subset of those characters. In our case, we want non- quote characters and non- commas [^",]*to appear within quotes, but we don't really need the quote characters themselves, so they didn't need to be captured. Finally, <mark>pay attention</mark>,kl. The quotes ""are the substring you are looking for, but there are no characters between the quotes, so the corresponding substring contains no characters (has zero length). <h3>Shall we consolidate our knowledge? Here are two and a half problems to help us with this:</h3> Using non-capturing groups (and capturing groups, and character classes, etc.), write a regular expression that only captures properly formatted file sizes on the line below :
pattern:
string:   6.6KB 1..3KB 12KB 5G 3.3MB KB .6.2TB 9MB .
matches: ^^^^^ ^^^^^ ^^^^^^ ^^^^ 
group:    11111 1111 11111 111    
( Solution ) Opening HTML tags start with a character <and end with a character >. HTML closing tags begin with a sequence of characters </and end with a >. The tag name is contained between these characters. Can you write a regex to only grab the names in the following tags? (You may be able to solve this problem without using non-capturing groups. Try this two ways! Once with groups and once without them.)
pattern:
string:   <p> </span> <div> </kbd> <link> 
matches: ^^^ ^^^^^^ ^^^^^ ^^^^^^ ^^^^^^ 
group:    1 1111 111 111 1111    
( Solution with non-capturing groups ) ( Solution without using non-capturing groups ) <h2>Step 17: backreferences \Nand named capturing groups</h2> RegEx: 20 short steps for mastering regular expressions.  Part 4 - 3Although I warned you in the introduction that trying to create an HTML parser with regular expressions usually leads to heartache, the last example is a nice transition to another (sometimes) useful feature of most regular expressions: backreferences. Backlinks are like repeating groups where you can try to grab the same text twice. But they differ in one important aspect - they will only capture the same text, character by character. While the repeating group will allow us to capture something like this:
pattern: (he(?:[az])+) 
string:   heyabcdefg hey heyo heyellow heyyyyyyyyy 
matches: ^^^^^^^^^^ ^^^ ^^^^ ^^^^^^^^^ ^^^ ^^^^^^^^ 
group:    1111111111 111 1111 11111111 11111111111    
( Example ) ... then the backreference will only match this:
pattern: (he([az])(\2+)) 
string: heyabcdefg hey heyo heyellow heyyyyyyyyy 
matches:                              ^^^^^^^^^^^ 
group:                                 11233333333    
( Example ) Repeating capturing groups are useful when you want to re-match the same pattern, while backreferences are good when you want to match the same text. For example, we could use a backreference to try and find the appropriate opening and closing HTML tags:
pattern: <(\w+)[^>]*>[^<]+<\/\1> 
string:   <span style="color: red">hey</span> 
matches: ^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
group:    1111    
( Example ) <mark>Please note</mark> that this is an extremely simplified example, and I strongly recommend that you do not attempt to write an HTML parser based on regular expressions. This is a very complex syntax, and you will most likely get sick. Named capturing groups are very similar to backreferences, so I'll cover them briefly here. The only difference between back references and a named capture group is that... a named capture group has a name:
pattern: <(?<tag>\w+)[^>]*>[^<]+<\/(?P=tag)></tag> string: < 
span   style="color: red">hey< /span> 
matches: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
group:    1111    
( Example ) You can create a named capturing group with the (?<name>...) or (?'name'...) syntax (.NET compatible regular expression) or with this syntax (?P<name>. ..) or (?P'name'...) (Python compatible regular expression). Since we are using PCRE (Perl Compatible Regular Expression) which supports both versions, we can use either one here. (Java 7 copied the .NET syntax, but only the angle bracket variant) To repeat a named capturing group later in a regular expression, we use \<kname> or \k'name' (.NET) or (?P= name) (Python). Again, PCRE supports all of these different options. You can read more about named capturing groups here., but that was most of what you really need to know about them. <h3>Question to help us:</h3> Use backlinks to help me remember... umm... this person's name.
pattern:
string: "Hi my name's Joe." [later] "What's that guy's name? Joe ?".
matches:        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ 
group:                  111    
( Solution ) <h2>Step 18: look forward (lookahead) and look back (lookbehind)</h2> RegEx: 20 short steps for mastering regular expressions.  Part 4 - 4We will now delve into some of the advanced features of regular expressions. Everything up to step 16 I use quite often. But these last few steps are only for people who are very serious about using regex to match very complex expressions. In other words, regular expression wizards. "Looking ahead" and "looking back" may sound complicated, but they're actually not that complicated. They allow you to do something similar to what we did with non-capturing groups earlier - check if any text exists just before or right after the actual text we want to match. For example, suppose we only want to match the names of things that people love, but only if they are enthusiastic about it (only if they end their sentence with an exclamation point). We could do something like:
pattern: (\w+)(?=!) 
string: I like desk. I appreciate stapler. I love lamp !
matches:                                           ^^^^ 
group:                                              1111    
( Example ) You can see how the above capturing group (\w+), which usually matches any of the words in the passage, only matches the word lamp. A positive "look ahead" (?=!)means that we can only match sequences that end in!but we don't actually match the exclamation mark character itself. This is an important distinction because with non-capturing groups, we match a character but don't capture it. With lookaheads and lookbehinds, we use a character to build our regular expression, but then we don't even match it against itself. We can match it later in our regular expression. There are four kinds of lookaheads and lookbehinds: positive lookahead (?=...), negative lookahead (?!...), positive lookahead (?<=...) and negative lookahead (?<!. ..). They do what they sound like - positive lookahead and lookbehind allow the regex engine to continue matching only when the text contained in the lookahead / lookbehind actually matches. Negative lookahead and lookbehind do the opposite - they allow the regular expression to match only when the text contained in the lookahead / lookbehind does not match. For example, we want to match method names only in the method sequence chain, not the object they operate on. In this case, each method name must be preceded by the symbol.. A regular expression using a simple lookback can help here:
pattern: (?<=\.)(\w+) 
string: myArray. flatMap.aggregate.summarise.print !
matches:         ^^^^^^^ ^^^^^^^^^ ^^^^^^^^^ ^^^^^ 
group:            1111111 111111111 111111111 11111    
( Example ) In the text above, we match any sequence of characters in the word \w+, but only if they are preceded by a .. We could achieve something similar using non-capturing groups, but the result is a bit messier:
pattern: (?:\.)(\w+) 
string: myArray .flatMap.aggregate.summarise.print !
matches:        ^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^ ^^^^^ 
group:            1111111 111111111 111111111 11111    
( Example ) Even though it's shorter, it matches characters we don't want. While this example may seem trivial, lookaheads and lookbehinds can really help us clean up our regular expressions. <h3>It's not long before the finish line! The following 2 tasks will bring us one step closer to it:</h3> The negative lookbehind (?<!...) allows the regex engine to keep trying to find a match only if the text contained within the negative lookbehind is not rendered until the rest of the text with which to match. For example, we could use a regular expression to match only the last names of women attending a conference. To do this, we'd like to make sure the person's last name doesn't come beforeMr.. Can you write a regular expression for this? (You can assume last names are at least four characters long.)
pattern:
string: Mr. Brown, Ms. Smith , Mrs. Jones , Miss Daisy , Mr. Green
matches:                ^^^^^ ^^^^^ ^^^^^ 
group:                   11111 11111 11111    
( Solution ) Suppose we are clearing the database and we have a column of information that stands for percentages. Unfortunately, some people have written numbers as decimal values ​​in the range [0.0, 1.0], while others have written percentages in the range [0.0%, 100.0%], and still others have written percentage values, but forgot the literal percent sign %. Using negative lookahead (?!...), can you mark only those values ​​that should be percentages but lack the %? These must be values ​​strictly greater than 1.00, but without a trailing %. (No number can contain more than two digits before or after the decimal point.) <mark>Please note</mark> that this solution is extremely difficult. If you can solve this problem without looking at my answer, then you already have great skills in regular expressions!
pattern:
string: 0.32 100.00 5.6 0.27 98% 12.2% 1.01 0.99% 0.99 13.13 1.10 
matches:      ^^^^^^ ^^^ ^^^^ ^^^^^ ^^^^ 
group:         111111 111 1111 11111 1111    
( Solution ) <h2>Step 19: Conditions in Regular Expressions</h2> RegEx: 20 short steps for mastering regular expressions.  Part 4 - 5We have now reached a point where most people will no longer use regular expressions. We've covered probably 95% of the use cases for simple regular expressions, and everything done in steps 19 and 20 is usually done by a more full-featured text manipulation language like awk or sed (or a general purpose programming language). However, let's continue, just so you know what a regular expression is really capable of. Although regular expressions are not Turing complete, some regex engines offer features that are very similar to a complete programming language. One such feature is the "condition". Conditional Regex expressions allow if-then-else statements where the selected branch is determined by either the "look ahead" or "look back" we learned about in the previous step. For example, you might want to match only valid entries in a list of dates:
pattern: (?<=Feb )([1-2][0-9])|(?<=Mar )([1-2][0-9]|3[0-1]) string: Dates 
worked : Feb 28 , Feb 29 , Feb 30, Mar 30 , Mar 31  
matches:                   ^^ ^^ ^^ ^^ 
group:                      11 11 22 22    
( Example ) <mark>Please note</mark> that the above groups are also indexed by month. We could write a regular expression for all 12 months and capture only valid dates, which would then be grouped into groups indexed by month of the year. The above uses a kind of if-like structure that will only match the first group if "Feb" precedes a number (and similarly for the second). But what if we wanted to use special handling for February only? Something like "if the number is preceded by 'Feb', do that, else do that other thing." This is how conditional expressions do it:
pattern: (?(?<=Feb )([1-2][0-9])|([1-2][0-9]|3[0-1])) string 
: Dates worked: Feb 28 , Feb 29 , Feb 30, Mar 30 , Mar 31  
matches:                   ^^ ^^ ^^ ^^ 
group:                      11 11 22 22    
( Example ) An if-then-else structure looks like (?(If)then|else) where (if) is replaced by "look ahead" or "look back". In the example above, (if) is written as (?<=Feb). You can see that we matched dates greater than 29, but only if they didn't follow "Feb ". Using lookbehinds in conditionals is useful if you want to make sure that some text precedes the match. Positive lookahead conditionals can be confusing because the condition itself doesn't match any text. So if you want the if condition to ever have a value, it should be comparable to a lookahead like below:
pattern: (?(?=exact)exact|else)wo 
string: exact else exactwo elsewo  
matches:            ^^^^^^^ ^^^^^^
( Example ) This means that positive lookahead conditionals are useless. You check to see if that text is in front and then provide a match pattern to follow when it is. The conditional doesn't help us here at all. You can also just replace the above with a simpler regular expression:
pattern: (?:exact|else)wo 
string: exact else exactwo elsewo  
matches:            ^^^^^^^ ^^^^^^
( Example ) So, the rule of thumb for conditional expressions is: test, test, and test again. Otherwise, solutions that you think are obvious will fail in the most exciting and unexpected ways :) <h3>Now we come to the last block of tasks that separates us from the final, 20th step:</h3> Write a regular expression that uses a negative lookahead conditional to check if the next word starts with a capital letter. If so, grab only one uppercase letter and then lowercase letters. If it doesn't, grab any word characters.
pattern:
string:   Jones Smith 9sfjn Hobbes 23r4tgr9h CSV Csv vVv 
matches: ^^^^^ ^^^^^ ^^^^^ ^^^^^^ ^^^^^^^^^ ^^^ ^^^ 
group:    22222 22222 11111 222222 111111111 222 111    
( Solution ) Write a negative lookbehind conditional that captures text ownsonly if it is not preceded by text cl, and that captures text oudsonly when it is preceded by text cl. (Slightly contrived example, but what can you do...)
pattern:
string: Those clowns owns some cl ouds . ouds.
matches:              ^^^^ ^^^^   
( Solution ) <h2>Step 20: Recursion and further training</h2> RegEx: 20 short steps for mastering regular expressions.  Part 4 - 6There's actually a lot to cram into a 20-step introduction to any topic, and regular expressions are no exception. There are many different implementations and standards for regular expressions that can be found online. If you want to learn more, I suggest you visit the great site regularexpressions.info , it's a fantastic reference and I certainly learned a lot about regular expressions from there. I highly recommend it as well as regex101.comto test and publish your creations. In this final step, I will give you some more knowledge about regular expressions, namely how to write recursive expressions. Simple recursions are pretty simple, but let's think about what that means in the context of a regular expression. The syntax for simple recursion in a regular expression is: (?R)?. But, of course, this syntax must appear within the expression itself. What we're going to do is nest the expression inside itself, an arbitrary number of times. For example:
pattern: (hey(?R)?oh) 
string:   heyoh heyyoh heyheyohoh hey oh heyhey hey heyheyohoh  
matches: ^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ 
group:    11111 1111111111 1111111111    
( Example ) Because the nested expression is optional ( (?R)followed ?), the simplest match is to simply ignore recursion entirely. So , heyand then ohmatches ( heyoh). To match any expression more complex than this, we must find that matching substring nested inside itself at the point in the expression where we inserted the (?R)sequence. In other words, we could find heyheyohoh or heyheyheyohohoh, and so on. One of the great things about these nested expressions is that, unlike backreferences and named capturing groups, they don't restrict you to the exact text you matched earlier, character by character. For example:
pattern: ([Hh][Ee][Yy](?R)?oh) 
string:   heyoh heyyoh hEyHeYohoh hey oh heyhey hEyHeYHEyohohoh  
matches: ^^^^^ ^^^^^^^^^^ ^^^^^ ^^^^^^^^^^ 
group:    11111 1111111111 111111111111111    
( Example ) You can imagine that the regular expression engine literally copies and pastes your regular expression into itself an arbitrary number of times. Of course, this means that sometimes it may not do what you might expect:
pattern: ((?:\(\*)[^*)]*(?R)?(?:\*\))) 
string: (* comment (* nested *) not *)
matches:            ^^^^^^^^^^^^ 
group:               111111111111    
( Example ) Can you tell why this regular expression only captured the nested comment and not the outer comment? One thing is for sure: when writing complex regular expressions, always test them to make sure they work the way you think. This is the end of this high-speed rally along the roads of regular expressions. Hope you enjoyed this trip. Well, and finally, I will leave here, as promised at the beginning, a few useful links for a more in-depth study of the material:
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION