JavaRush/Java Blog/Random EN/RegEx: 20 short steps to master regular expressions. Part...
Artur
Level 40

RegEx: 20 short steps to master regular expressions. Part 4

Published in the Random EN group
members
RegEx: 20 short steps to master regular expressions. Part 1 RegEx: 20 short steps to master regular expressions. Part 2 20 short steps to master regular expressions. Part 3 This final part, in the middle, will touch on things that are mainly used by regular expression masters. But the material from the previous parts was easy for you, right? This means you can handle this material with the same ease! Original here RegEx: 20 short steps to master regular expressions.  Part 4 - 1 <h2>Step 16: groups without capturing (?:)</h2> RegEx: 20 short steps to master regular expressions.  Part 4 - 2In the two examples in the previous step, we were capturing text that we didn't really need. In the File Sizes task, we captured the spaces before the first digit of the file sizes, and in the CSV task, we captured the commas between each token. We don't need to capture these characters, but we do need to use them to structure our regular expression. These are ideal options for using a group without capturing, (?:). A non-capturing group does exactly what it sounds like - it allows characters to be grouped and used in regular expressions, but does not capture them in a numbered group:
pattern: (?:")([^"]+)(?:") 
string: I only want "the text inside these quotes" .
matches:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
group:                 1111111111111111111111111111    
( Example ) The regular expression now matches the quoted text as well as the quote characters themselves, but the capture group only captured the quoted text. Why should we do this? The point is that most regular expression engines allow you to recover text from capture groups defined in your regular expressions. If we can trim the extra characters we don't need without including them in our capture groups, it will make it easier to parse and manipulate the text later. Here's how to clean up the CSV parser from the previous step:
pattern: (?:^|,)\s*(?:\"([^",]*)\"|([^", ]*)) 
string:   a , " b ", " cd ", e , f , " gh ", dfgi ,, k , "", l 
matches: ^ ^ ^^^ ^ ^ ^^^ ^^^^ ^ ^ 
group:    2 1 111 2 2 111 2222 2 2    
( Example ) There are a few things to <mark>notice here:</mark> First, we are no longer capturing commas since we changed the capturing group (^|,)to a non-capturing group (?:^|,). Second, we nested the capture group within the non-capture group. This is useful when, for example, you need a group of characters to appear in a specific order, but you only care about a subset of those characters. In our case, we needed non- quote characters and non- commas [^",]*to appear in quotes, but we didn't actually need the quote characters themselves, so they didn't need to be captured. Finally, <mark>note</mark> that in the example above there is also a zero-length match between the characters kand l. The quotes ""are the searched substring, but there are no characters between the quotes, so the matching substring contains no characters (zero length). <h3>Shall we consolidate our knowledge? Here are two and a half tasks that will help us with this:</h3> Using non-capturing groups (and capturing groups, and character classes, etc.), write a regular expression that captures only properly formatted file sizes on the line below :
pattern:
string:   6.6KB 1..3KB 12KB 5G 3.3MB KB .6.2TB 9MB .
matches: ^^^^^ ^^^^^ ^^^^^^ ^^^^ 
group:    11111 1111 11111 111    
( Solution ) HTML opening tags start with <and end with >. HTML closing tags begin with a sequence of characters </and end with the character >. The tag name is contained between these characters. Can you write a regular expression to capture only the names in the following tags? (You may be able to solve this problem without using non-capturing groups. Try solving this two ways! Once with groups and once without.)
pattern:
string:   <p> </span> <div> </kbd> <link> 
matches: ^^^ ^^^^^^ ^^^^^ ^^^^^^ ^^^^^^ 
group:    1 1111 111 111 1111    
( Solution using non-capturing groups ) ( Solution without using non-capturing groups ) <h2>Step 17: Backlinks \Nand named capturing groups</h2> RegEx: 20 short steps to master regular expressions.  Part 4 - 3Although I warned you in the introduction that trying to create an HTML parser using regular expressions usually leads to heartache, this last example is a nice segue into another (sometimes) useful feature of most regular expressions: backreferences. Backlinks are like repeating groups where you can try to capture the same text twice. But they differ in one important aspect - they will only capture the same text, character by character. While a repeating group will allow us to capture something like this:
pattern: (he(?:[az])+) 
string:   heyabcdefg hey heyo heyellow heyyyyyyyyy 
matches: ^^^^^^^^^^ ^^^ ^^^^ ^^^^^^^^ ^^^ ^^^^^^^^ 
group:    1111111111 111 1111 11111111 11111111111    
( Example ) ...then the backlink will match only this:
pattern: (he([az])(\2+)) 
string: heyabcdefg hey heyo heyellow heyyyyyyyyy 
matches:                              ^^^^^^^^^^^ 
group:                                 11233333333    
( Example ) Repeating capture groups are useful when you want to match the same pattern repeatedly, whereas backlinks are good when you want to match the same text. For example, we could use a backlink to try to find matching opening and closing HTML tags:
pattern: <(\w+)[^>]*>[^<]+<\/\1> 
string:   <span style="color: red">hey</span> 
matches: ^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
group:    1111    
( Example ) <mark>Please note</mark> that this is an extremely simplified example and I strongly recommend that you do not try to write a regular expression based HTML parser. This is very complex syntax and will most likely make you sick. Named capture groups are very similar to backlinks, so I'll cover them briefly here. The only difference between backreferences and a named capture group is that... a named capture group has a name:
pattern: <(?<tag>\w+)[^>]*>[^<]+<\/(?P=tag)></tag> 
string:   <span style="color: red">hey< /span> 
matches: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
group:    1111    
( Example ) You can create a named capturing group using (?<name>...) or (?'name'...) syntax (.NET-compatible regular expression) or with this syntax (?P<name>. ..) or (?P'name'...) (Python-compatible regular expression). Since we are using PCRE (Perl Compatible Regular Expression) which supports both versions, we can use either one here. (Java 7 copied the .NET syntax, but only the angle brackets version. Translator's note) To repeat a named capturing group later in a regular expression, we use \<kname> or \k'name' (.NET) or (?P= name) (Python). Again, PCRE supports all of these different options. You can read more about named capture groups here , but this was most of what you really need to know about them. <h3>Task to help us:</h3> Use backlinks to help me remember... ummm... this person's name.
pattern:
string: "Hi my name's Joe." [later] "What's that guy's name? Joe ?"
matches:        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ 
group:                  111    
( Solution ) <h2>Step 18: lookahead and lookbehind</h2> RegEx: 20 short steps to master regular expressions.  Part 4 - 4Now we'll dive into some of the advanced features of regular expressions. I use everything up to step 16 quite often. But these last few steps are only for people who use regex very seriously to match very complex expressions. In other words, masters of regular expressions. "Looking Forward" and "Looking Back" may seem quite complicated, but they really aren't too complicated. They allow you to do something similar to what we did with non-capturing groups earlier - check if there is any text immediately before or immediately after the actual text we want to match. For example, suppose we want to match only the names of things that people like, but only if they are enthusiastic about it (only if they end their sentence with an exclamation point). We could do something like:
pattern: (\w+)(?=!) 
string: I like desk. I appreciate stapler. I love lamp !
matches:                                           ^^^^ 
group:                                              1111    
( Example ) You can see how the above capture group (\w+), which usually matches any of the words in the passage, only matches the word lamp. Positive look-ahead (?=!)means that we can only match sequences that end in !but we don't actually match the exclamation mark character itself. This is an important distinction because with non-capturing groups we are matching the character but not capturing it. With lookaheads and lookbehinds, we use a character to build our regular expression, but then we don't even match it against itself. We can match it later in our regular expression. There are four types of lookaheads and lookbehinds: positive lookahead (?=...), negative lookahead (?!...), positive lookahead (?<=...) and negative lookahead (?<!. ..). They do what they sound like - positive lookahead and lookbehind allow the regular expression engine to continue matching only when the text contained in the lookahead/lookbehind actually matches. Negative lookahead and lookbehind do the opposite - they allow the regex to match only when the text contained in the lookahead/lookbehind does not match. For example, we want to match method names only in a chain of method sequences, not the object they operate on. In this case, each method name must be preceded by a .. A regular expression using a simple look-back can help here:
pattern: (?<=\.)(\w+) 
string: myArray. flatMap.aggregate.summarise.print !
matches:         ^^^^^^^ ^^^^^^^^^ ^^^^^^^^^ ^^^^^ 
group:            1111111 111111111 111111111 11111    
( Example ) In the above text, we match any sequence of word characters \w+, but only if they are preceded by the character .. We could achieve something similar using non-capturing groups, but the result is a little messier:
pattern: (?:\.)(\w+) 
string: myArray .flatMap.aggregate.summarise.print !
matches:        ^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^ ^^^^^ 
group:            1111111 111111111 111111111 11111    
( Example ) Even though it's shorter, it matches characters that we don't need. While this example may seem trivial, lookaheads and lookbehinds can really help us clean up our regular expressions. <h3>There are very few left until the finish! The following 2 tasks will bring us 1 step closer to it:</h3> Negative lookbehind (?<!...) allows the regular expression engine to continue trying to find a match only if the text contained inside the negative lookbehind is not displayed until the rest of the text , with which you need to find a match. For example, we could use a regular expression to match only the last names of women attending a conference. To do this, we would like to make sure that the person's last name is not preceded by a Mr.. Can you write a regular expression for this? (Last names can be assumed to be at least four characters long.)
pattern:
string: Mr. Brown, Ms. Smith , Mrs. Jones , Miss Daisy , Mr. Green
matches:                ^^^^^ ^^^^^ ^^^^^ 
group:                   11111 11111 11111    
( Solution ) Let's say we are clearing a database and we have a column of information that represents percentages. Unfortunately, some people wrote numbers as decimal values ​​in the range [0.0, 1.0], while others wrote percentages in the range [0.0%, 100.0%], and still others wrote percentage values. but forgot the literal percent sign %. Using negative lookahead (?!...), can you mark only those values ​​that should be percentages but are missing digits %? These must be values ​​strictly greater than 1.00, but without a trailing %. (No number can contain more than two digits before or after the decimal point.) <mark>Note</mark> that this solution is extremely difficult . If you can solve this problem without looking at my answer, then you already have huge skills in regular expressions!
pattern:
string: 0.32 100.00 5.6 0.27 98% 12.2% 1.01 0.99% 0.99 13.13 1.10 
matches:      ^^^^^^ ^^^ ^^^^ ^^^^^ ^^^^ 
group:         111111 111 1111 11111 1111    
( Solution ) <h2>Step 19: Conditions in Regular Expressions</h2> RegEx: 20 short steps to master regular expressions.  Part 4 - 5We have now reached the point where most people will no longer use regular expressions. We've covered probably 95% of the use cases for simple regular expressions, and everything done in steps 19 and 20 is typically done by a more full-featured text manipulation language like awk or sed (or a general-purpose programming language). That said, let's move on, just so you know what a regular expression can really do. Although regular expressions are not Turing complete , some regular expression engines offer features that are very similar to a complete programming language. One such feature is "condition". Regex conditionals allow if-then-else statements, where the chosen branch is determined by either the "look forward" or "look back" we learned about in the previous step. For example, you might want to match only valid entries in a list of dates:
pattern: (?<=Feb )([1-2][0-9])|(?<=Mar )([1-2][0-9]|3[0-1]) 
string: Dates worked : Feb 28 , Feb 29 , Feb 30 , Mar 30 , Mar 31  
matches:                   ^^ ^^ ^^ ^^ 
group:                      11 11 22 22    
( Example ) <mark>Note</mark> that the above groups are also indexed by month. We could write a regular expression for all 12 months and capture only valid dates, which would then be combined into groups indexed by month of the year. The above uses a sort of if-like structure that will only look for matches in the first group if "Feb" precedes a number (and similarly for the second). But what if we only wanted to use special processing for February? Something like "if the number is preceded by "Feb", do this, otherwise do this other thing." Here's how conditionals do it:
pattern: (?(?<=Feb )([1-2][0-9])|([1-2][0-9]|3[0-1])) 
string: Dates worked: Feb 28 , Feb 29 , Feb 30, Mar 30 , Mar 31  
matches:                   ^^ ^^ ^^ ^^ 
group:                      11 11 22 22    
( Example ) The if-then-else structure looks like (?(If)then|else), where (if) is replaced by "look forward" or "look back". In the above example, (if) is written as (?<=Feb). You can see that we matched dates greater than 29, but only if they did not follow "Feb". Using lookbehinds in conditional expressions is useful if you want to ensure that the match is preceded by some text. Positive lookahead conditionals can be confusing because the condition itself does not match any text. So if you want the if condition to ever have a value, it must be comparable to lookahead like below:
pattern: (?(?=exact)exact|else)wo 
string: exact else exactwo elsewo  
matches:            ^^^^^^^ ^^^^^^
( Example ) This means that positive lookahead conditionals are useless. You check to see if that text is in front and then provide a matching pattern to follow when it is. The conditional expression doesn't help us here at all. You can also just replace the above with a simpler regular expression:
pattern: (?:exact|else)wo 
string: exact else exactwo elsewo  
matches:            ^^^^^^^ ^^^^^^
( Example ) So, the rule of thumb for conditional expressions is: test, test, and test again. Otherwise, solutions that you think are obvious will fail in the most exciting and unexpected ways :) <h3>Here we come to the last block of tasks that separates us from the final, 20th step:</h3> Write a regular expression that uses negative lookahead conditional expression to test whether the next word begins with a capital letter. If so, grab only one capital letter and then the lowercase letters. If it doesn't, grab any word characters.
pattern:
string:   Jones Smith 9sfjn Hobbes 23r4tgr9h CSV Csv vVv 
matches: ^^^^^ ^^^^^ ^^^^^ ^^^^^^ ^^^^^^^^^ ^^^ ^^^ 
group:    22222 22222 11111 222222 111111111 222 111    
( Solution ) Write a negative lookbehind conditional expression that captures text ownsonly if it is not preceded by text cl, and that captures text oudsonly when it is preceded by text cl. (A bit of a contrived example, but what can you do...)
pattern:
string: Those clowns owns some cl ouds . ouds.
matches:              ^^^^ ^^^^   
( Solution ) <h2>Step 20: Recursion and Further Study</h2> RegEx: 20 short steps to master regular expressions.  Part 4 - 6In fact, there is a lot that can be squeezed into a 20-step introduction to any topic, and regular expressions are no exception. There are many different implementations and standards for regular expressions that can be found on the Internet. If you want to learn more, I suggest you check out the wonderful site regularexpressions.info , it's a fantastic reference and I certainly learned a lot about regular expressions from there. I highly recommend it, as well as regex101.com for testing and publishing your creations. In this final step, I'll give you a little more knowledge about regular expressions, namely how to write recursive expressions. Simple recursions are pretty simple, but let's think about what that means in the context of a regular expression. The syntax for simple recursion in a regular expression is written like this: (?R)?. But, of course, this syntax must appear within the expression itself. What we will do is nest the expression within itself, an arbitrary number of times. For example:
pattern: (hey(?R)?oh) 
string:   heyoh heyyoh heyheyohoh hey oh heyhey hey heyheyohoh  
matches: ^^^^^ ^^^^^^^^^^ ^^^^^^^^^^ 
group:    11111 1111111111 1111111111    
( Example ) Since the nested expression is optional ( (?R)followed ?), the simplest match is to simply ignore the recursion completely. So, hey, and then ohmatches ( heyoh). To match any more complex expression than this, we must find that matching substring nested inside itself at the point in the expression where we inserted (?R)the sequence. In other words, we could find heyheyohoh or heyheyheyohohoh, and so on. One of the great things about these nested expressions is that, unlike backreferences and named capturing groups, they don't restrict you to the exact text you matched previously, character by character. For example:
pattern: ([Hh][Ee][Yy](?R)?oh) 
string:   heyoh heyyoh hEyHeYohoh hey oh heyhey hEyHeYHEyohohoh  
matches: ^^^^^ ^^^^^^^^^^ ^^^^^ ^^^^^^^^^^ 
group:    11111 1111111111 111111111111111    
( Example ) You can imagine that the regular expression engine literally copies and pastes your regular expression into itself an arbitrary number of times. Of course, this means that sometimes it may not do what you might have hoped:
pattern: ((?:\(\*)[^*)]*(?R)?(?:\*\))) 
string: (* comment (* nested *) not *)
matches:            ^^^^^^^^^^^^ 
group:               111111111111    
( Example ) Can you tell why this regex only captured the nested comment and not the outer comment? One thing is for sure: when writing complex regular expressions, always test them to make sure they work the way you think they will. This high-speed rally along the roads of regular expressions has come to an end. I hope you enjoyed this journey. Well, and finally, I will leave here, as I promised at the beginning, several useful links for a more in-depth study of the material:
Comments
  • Popular
  • New
  • Old
You must be signed in to leave a comment
This page doesn't have any comments yet