Regex: Demystifying the Hieroglyphics

Time

Wednesday, 1:15 pm CDT - Wednesday, 2:15 pm CDT

Location

Speaker presented remotely with presentation projected in Room 325

Organizer Note

This is a Hybrid session where the speaker will present remotely. The session will be displayed on a projector in the assigned room on-site. Participants may either join remotely via the linked Zoom or attend in-person.

Description

You're working with some data and find yourself needing to find a specific piece of information, but your searches keep matching things that you don’t want. While basic searches are fine for some things, every once in a while you need something more powerful.

Enter regular expressions (also referred to as "regex"). BUT WAIT! Before you scroll to the next session description: did you know that this powerful searching capability is available in things like Word and Excel, as well as Google Search Console and Google Analytics? And not only can you use it to search for information, you can also use it to transform information!

In this session, we'll jump into a brief history of regex, discuss pattern matching, and dive into the fundamentals of how we can use regex to surface the data we've been searching for in powerful ways. We'll play some regex-based games to help hone our skills, all while demystifying the idea that regular expressions are dark magic, beyond the capabilities of mere mortals.

Speakers

Paul Gilzow

Developer Relations Engineer @ Platform.sh

Developer Relations Engineer at Platform.sh. Former Programmer/Analyst-Principal at the University of Missouri. Web application security and accessibility evangelist. Software instructor. Conference lecturer and presenter. Runs on passion and coffee. Outside of work, you'll find Gilzow mountain biking, snowboarding, enjoying live music with his kids, and dancing wherever the mood strikes.

Track

Back-End

DevOps

Front-End

Feedback Form

Transcript

PAUL GILZOW:
Alright. Hello, everyone. I hope everybody's having a great time at mid camp there in person as well as online. I really, really wanted to be there today with all of you, I just wasn't able to make it work. My name is Paul. I am with platform.sh. We are a secure enterprise-grade platform as a service for building, scaling and deploying applications. We focus on maintaining, managing and most importantly, automating your infrastructure so you can focus on the next, building the next amazing thing. But that's probably not why you're here. You're here to learn about regular expressions. So specifically, what I wanna cover today is I wanna talk about what regular expressions are, a little bit about what they're not, some use cases for when you might use regular expressions, including sometimes that maybe you didn't know you could use regular expressions, and then we're gonna just dive into the hieroglyphics. We're gonna go into all those different symbols and characters, and I'm gonna walk through each of those and show you in real time how using those different characters can change how that regular expression works.

And then we're gonna go play a game together and we're gonna build some regular expressions to solve a puzzle. So, I do encourage you to follow along if you can. I've got a regular expression repository out at my github profile and show you that here real quick. So it should be pinned up here at the top. If you go into there, I've got some different word banks that I'm gonna be using. I'm also gonna be using a tool called regex101.com. So again, if you can follow along, that's great, if not, not a problem. I'm gonna be showing you all of this and stepping through it with you. So, kind of an introduction. Regular expressions originated back in the early 1950s by a mathematician named Stephen Cole Kleene, who was trying to describe regular languages in formal language theory. Essentially what he wanted to do is come up with a mathematical formula to describe languages. And while that's extremely interesting from an academic standpoint, we could probably spend multiple semesters going over regular expressions.

That's probably not what you think of when you hear the phrase regular expressions. What you probably think of is something more like this. If you happen to be listening in, we've got an animation of Bert from Sesame Street, who's passing out after seeing a fairly long, complex, regular expression. So in terms of this presentation, when I say regular expression, what I'm referring to is a sequence of characters that we use to represent some type of pattern that we're attempting to find inside a larger body of text or string. So just to kind of back up, though, some quick examples of what regular expressions are not. And first is regular expressions are not a programming language. While we do have some basic logic and some branching, it's not a programming language. It's also not unlearnable. It's, I've had, been giving this presentation before, I've had multiple people say something like this. And this is from an actual attendee in the past who said, you know, "there's nothing regular about regular expressions," and they do have a bit of a learning curve.

They are somewhat complex, and we're gonna get into why some of that complexity is there. But I really do strongly believe that they're a very powerful tool for you to have available in your toolbox. Now, I don't know about you, but often when I get a new tool, especially a powerful one, be it a physical tool or one in programming, I often begin to see that every problem I have could be solved with this tool. So, regular expressions are not the solution to every problem. You've probably seen this XKCD comic before and if you're listening in, we've got two stick figures and one saying, "if you're having Perl problems, I feel bad for you, son. I got 99 problems. So I used regular expressions and now I have 100 problems." And it's true. In some situations, regular expressions are not the answer and can actually make things worse. So we're gonna talk a little bit about that. However, once you kind of get used to them and can begin to understand when they're appropriate to use and how to optimize them, they can kind of make you feel like a superhero.

So, what can we use them for? Well, in their basic use, we use them to find text inside these larger bodies of text or larger strings. You can kind of think of them as control F or command F on steroids. As an example, I'm gonna show this to you live. You know, maybe I've got this document about wolves and I need to find every instance in this document, we'll pretend this is hundreds, thousands of pages long. I need to find every instance of the word wolf or wolves as it pertains to, you know, gray wolves or brown wolves or rare wolves. I wanna find any instance except for those that have to do with red wolves. I don't wanna find red wolves. So I could do control F and I could search for wolves. And, you know, I do find wolves, rare wolves, but I'm also finding red wolves. And I'm not finding wolf, you know, I have to do a separate one for that. And instead I need to find both at the exact same time. So ,I don't know if you knew this, but inside of both, and I have to move my zoom window, there we go, so I can see this.

Inside both word and Google Docs you can use regular expressions. So I could use a regular expression, grab this and move it out of the way. And now notice I'm finding gray wolf, I'm finding rare wolves, brown wolves, but notice, I'm not matching or not finding red wolves. Alright, another example is we can use it to validate texts. So let's say we're receiving some text from an outside source via, you know, maybe an API or it's a form, we can use regular expressions to ensure that the information receiving is not only contains exactly what we're expecting, but also in a specific format. Another use case is string manipulation. We can actually take a string, we can pluck out pieces from it, reformulate it, and put it back. As an example, maybe we've got some phone numbers. Grab this one here real quick and I'll apologize to anyone not watching, that's not in the United States. All of these examples are formulated, excuse me, formulated for United States. So, but let's say I've got phone numbers coming in and you know, some are using dashes, some are using perrins, some are using spaces, some are using dots, all over the place.

And I need these all to be the exact same format. Well, by using a regular expression, might paste this in here, paste, I can pluck out those numbers and then reformulate them so that every single one is the exact same formula. Alright, so then how do we actually use it? How do we begin to use these characters to do some of this? Well, as I mentioned earlier, regular expressions are a sequence of characters that are used to represent patterns, right. So we've got all different types of characters. We've got literal characters, we have special characters, we have something called character classes. As we begin to build larger and larger character classes, they get a little complex. So sometimes we might have access to shorthand character classes, just tons and tons of different characters. So what I wanna do is assume you know nothing and start at the beginning, at the ground floor and then build up through all of these different types of characters showing you along the way how each of these changes what we're finding and what we're matching.

So the first are literal characters. Fo and o is a valid regular expression, that is a collection of literal characters. The literal character f followed by an o, followed by an o. And if I go back over to my tool, I've got this word bank, so you can go back to match, got this word bank of just random words. And if I type in the literal characters f, o and o I do see some matches. I've got foo here at the end of some path to foo, foo, foot, the foo in foot, the foo in foobar, the foo in barfoo etc. Alright, so f, o and o by themselves, literal characters are a valid regular expression. Now I wanna pause real quick and point out a couple of things and hopefully you can see my mouse. I've got this little character here before my regular expression. I've got a character here after the regular expression, that forward slash, that is a delimiter. A delimiter is a character that's used to represent the edges or the boundaries of your regular expression. So when we give this regular expression to a regular expression engine, we're saying, what's contained inside these delimiters is the regular expression that I want you to use in order to find this pattern.

Now, most regular expression engines will allow you to change the character, although that forward slash is the most common. Now, why might we need to change the character? Well, let's say I'm trying to match path to foo. You, know, I need to actually use that forward slash. And notice as I've tried to use that as a pattern, I automatically get this pattern error saying, 'wait, you can't do that,' because you've used the delimiter inside the regular expression and now I don't know what the actual regular expression is supposed to be. So in this particular case, with this tool, I can change my delimiter here before my pattern, I can change it to something else, and now notice I'm matching exactly what I typed out, again, those literal characters. Now, I do wanna pause real fast because I've used the term regular expression engine several times, and I haven't exactly explained what I mean by that. How is an engine different than a regular expression? Well, a regular expression engine is a piece of software that can ingest a regular expression and then use it against that larger body of text or string in order to find those patterns.

Now, they are sometimes referred to as flavors because each one is different. Now, if you're listening in, I've got an animation of a red light flashing to give us an indication of an issue or a warning, and that is that not every engine is the same. If you are old enough to remember the Web browsers, the Internet back in the early 2000, you'll probably remember that most websites had a little thing on it that said Best viewed internet Explorer or best viewed in Netscape Navigator. And that was because back then every browser was introducing proprietary features that didn't necessarily work in another one, and its exact same with regular expression engines. They're free to implement features that not everybody else can use. And part of the problem is that they're, the standards aren't really there. There's really only one true standard for regular expressions, and that's POSIX basic regular expression, and it's very limited and it's pretty old. Now, you might have heard if you use regular expressions before, you might have heard of PCRE or Perl compatible regular expressions, and it's been adopted as a quasi standard, but it's not really, it's an open source library (UNKNOWN) that originated because they were trying to bundle in features that were compatible with what Perl had introduced in its regular expression engine.

Now, interestingly enough, because it is an open source library, it has gone through several versions and several iterations and in fact at one point ERL adopted features from Perl compatible regular expressions to make Perl's instance compatible with Perl compatible (UNKNOWN) kind of see where that's going. Yeah, so just make sure that you're always testing in a regular expression too. And the reason is you wanna make sure that the regular expression that you're building is going to actually work in the specific engine that you're going to use. Now, besides just warnings, I'm gonna try to give you a series of bonuses. So if you're listening in, I've got an animation of Oprah Winfrey pointing out to the audience saying, "you get a bonus and you get a bonus and everybody gets a bonus." And that bonus is use a regular expression builder. There are a ton of available. The three I've got listed here are just examples. Today I'm using regex101, but there's regexr, there's rubular. Again, there's a ton out there that you can use.

If you prefer a native or an installed option, you've got regexBuddy on windows, which is phenomenal. There's also expressions on macOS, but use these tools because they give you so much information and help. You know, now that earlier it gave us that warning that there was a pattern error. Notice here on the left hand side, in this particular tool, I can choose from those flavors, I can change whether I'm matching or substituting or listing. It gives me information on what's going on with my regular expression pattern, in this case, I've got an error. Or if I were to change it back to f, o and o it says, "hey, this is what you're doing, you're trying to match those literal characters." When it finds a match and list them out, it also gives me a quick reference or a quick access to documentation on regular expressions. So, just make sure that you're using a regular expression tool. OK, so now into the special characters. We have 12 special characters that we're going to go through. If you noticed earlier, when I had that path to foo and it highlighted that I use these, that forward slash, when I hover over it, it says, 'hey, you've got an unescaped delimiter." In most engines or in most languages, you escape that with a backslash.

So that backslash character is our first special character. It is the escape character, it tells the regular expression engine, "hey, the character that is next, don't treat it how you normally would, instead use its alternate meaning." So if I use a backslash, then a forward slash it says don't treat that forward slash as a delimiter, treat it as the literal forward slash character. And we're gonna see this delimiter come up several times as we go through some of these examples. Our next character is part of a subgroup called anchors, and it's the caret symbol. An anchor tells the regular expression engine that the pattern has to appear in a specific location inside the string or in the line. In this case, this anchor character, this caret symbol is saying, "hey, this pattern has to begin or has to show up at the beginning or the start of a string or a line." So I'll come back out here and I type in my little characters foo again. Notice I am matching foo at the end of some path to foo, a matching bar foo a matching foobar.

If I now use that anchor, notice it's no longer matching those instances of f,o,o where they show up somewhere else. It's only matching the ones where they start at the beginning of a line. Alright, if we have an anchor to the beginning, we probably need an anchor at the end and that's the dollar sign. So the dollar sign is also an anchor, but it tells the regular expression engine, "hey, this pattern has to appear at the end of a string or line." So I come out to my tool and I get rid of the anchor back to foo. So notice again, I'm matching foot, foobar, foobar, etc. If I use the dollar sign to anchor to the end, now I'm only matching some path to foo and foo by itself, barfoo etc. So only those instances where f, o and o show up at the end of a string or a line. Alright, so I got another bonus for you. That is to always anchor, whenever you can you should anchor. And the reason is optimization. As a regular expression is parsing, excuse me, as the regular expression engine is parsing your regular expression, comparing it against those strings and lines, it is walking through that string or line, step by step, by step by step.

If we can use an anchor, then the regular expression can quickly or more quickly determine whether or not it should continue evaluating that string or line. So if we said, "hey, this pattern has to show up at the beginning," well then as it checks the beginning, if it's not true, it doesn't have to continue down the rest, it can skip and go to the next. Same for the end. If we've said, "hey, this has to be at the end", it can evaluate the end of that string of the line. And if that pattern is not there, it doesn't have to continue evaluating and can move on. So even if your regular expression is matching what you anticipate, you should still go ahead and try to use anchors whenever possible. The next special character we have is the square bracket. The square bracket is, tells the regular expression engine, "hey, we are going to be using what is known as a character class." A character class allows us to define a set of characters that we want to match. So we're gonna tell it, it tells it that we wanna match a single literal character from the list inside the character class.

It also allows us to define a range of literal characters. One thing I failed to mention earlier is that regular expression engines by default are gonna be case sensitive. So if you wanted to find a word and you didn't care if it was upper or lowercase, inside your character class, you'd have to say a, b, c, d, e, f all lowercase and then capital A through Z, right. Well, instead we can have these character classes define ranges. And we could say, OK, I want you to find a through z lowercase and then A through Z uppercase. So we don't have to type out all the specific literal characters. And then just as a special note, the ending square bracket is not a special character on its own. It's only a special character when used in conjunction with the opening square bracket to create that character class. So let's come back out and take a look at it real quick. So, I could do a character class of, say, maybe foo, well not foo, f, and o, and o. And that's the exact same as we did earlier, right.

We were looking for f, o and o. But maybe I don't wanna find just foo, maybe I wanna find f or b, o, o. So now notice I'm finding foo, foo, foot, foot, boo, boo, but I'm not matching roobar or coobar or anything else, right. Alright, let's try a range. Maybe I wanna find any single literal character between b and s. And now notice I'm matching foo, but I'm also matching coofar, but not roobar. Or I could say, "hey, you know, I wanna find b, f or r. And now I'm matching foo, boo, but I'm, and roo but not coo. Alright, so character class, we can match any character inside that class, whether it's a range or specific characters. Hopefully everybody still good? OK, cause the next special character, whoops, I almost forgot. Oops, Sorry. So, inside that special, inside that character class, we don't have to escape other special characters. The only exceptions are the ending square bracket cause that creates the character class. The backslash character, the escape character, because that allows us to escape.

Maybe say we need to actually want or we actually want to include the ending square bracket, the caret symbol, which we'll talk about in a second, and that dash, cause that dash is what allows us to create that range. Now, the next special character is the caret. It is the negation character for character classes. If you're listening in, I've got an animation of Michael Scott from The Office who's holding up his hand saying, "wait a second." And this is the first prime example of why regular expressions can be very challenging to learn. That's because these special characters they're, what they stand for, what these represent can change depending on the characters they're next to or where they're used inside of a regular expression. So, when you use the caret inside of the character class, you are negating or inverting the character class. So what I mean by that? If I were to come back out here like I did earlier and say, you know, f, o and o, if I use the caret inside the character class that says find any character that isn't an f.

Now notice I'm matching question mark o, o because that question mark is not anf. Matching coo and boo and roo and boof are matching any character that isn't an f. So inside the character class when you use that caret negates the character class. Outside, and this is where it gets really confusing. I can also use it outside and now notice it's only matching those strings that are some character that isn't o, excuse me, isn't an f followed by two o's where that is at the beginning of the character class. So, just know that if you are seeing some of these characters used in multiple locations, there's a good chance that its representation has changed because of where it's being used. Whew, OK. The next one is, no is our shorthand character class. OK, so we have character classes. We just talked about those, right. And if we wanted to do, you know, capital A through Z and lowercase a through z, etc., we'd have to build those out. And that gets kind of long. If I come back in here and I said, I wanna find A through Z and then a through z and zero through nine, that's a pretty big character class, right?

They start to get kind of long and hard to read. So shorthand character classes, sometimes referred to as special sequences, are that delimiter character followed by what is normally a literal character, and it represents a larger character class. So some examples. The backslash lowercase d is shorthand for the character class zero through nine. The w is shorthand for capital A through Z, lower through case a through z, zero through nine and the underscore. And there's about 26 or 27 more that you can use. Now you don't have to worry about memorizing these cause remember, we have that quick reference and I can see right away. Here it says, hey, you know, if you want any whitespace character, well, that's the backslash lowercase s. As well as if you see them, if you use it, then remember that that explanation section will tell you, "hey, the backslash lowercase d is the equivalent to the character class zero through nine." So if you're looking at a regular expression and you're trying to understand what it's doing and you see that backslash character and what should normally be a literal character, that's a good indication that's a special character class or the special sequence.

Oh, OK, the next one's a dot. I call it the weird one. The reason I call it the weird one is earlier, you know, the caret and the dollar sign went together for anchoring. And we had the square bracket, it has the ending square bracket to create, it has like a partner. The dot's the weird one cause it doesn't really have anybody else. It doesn't have any other characters that are connected with it. And the dot simply matches any single character, excuse me, except for line breaks. So if I come in here and I recreate my regular expression and I use a dot with two o's, well then it's gonna match any character that's not a new line and an o, o. And I can see there I've matched foo, but I've also matched coo and boo and I've even matched that question mark o, o. So, the dots just simply stands in for any literal character with the exception of line breaks. This one's a fun one. The pipe symbols. The pipe symbol creates a branching structure or similar to like an or in programming. In this case, in my example on the screen, it's saying, I want you to find bar or I want you to find foo.

So I come out, oops, (UNKNOWN) too many screens. Here we go. So if I type in foo or boo, notice it's matching foo, but it's also matching boo. So it says, "hey, I want you to find f, o and o and if you can't find f, o and o, then also you see if you can find b, o and o. Now there is a warning about alternations, and that is that regular expressions are, excuse me, regular expression engines are, what's a good word? Very eager, they're very eager to return a match. They wanna find a match as quickly as possible and then return that match, so it can go on and keep processing. So in the example, there were many cats near the bowl with one cat by the door. If we were to use an alternation of cat or cats, notice that matched the c, a, t in cats and returned even though c, a, t, s would be more complete. It's because we've said find c, a, t, it said, "hey, I found c, a, t. There's an s but (UNKNOWN) I found c, a, t. Here it is." Alright, so just remember, it's not that you shouldn't use alternations or that you can't, it's just you need to remember that the left hand side in an alternation inside a regular expression engine is always gonna get precedence.

If it can find that there's a match for that one first, it's always gonna return it, even if the second one is more complete or more appropriate. Alright, the next special character we have is part of a subgroup called quantifiers. Quantifiers tell the regular expression engine to repeat the previous token. I'm gonna come back to that word token, the previous token in the regular expression so many times. With the question mark, it's either zero times or one times. What I mean by a token is, in a moment we're gonna talk about sub-patterns and that token (UNKNOWN) stands for or means any complete pattern. So an o, a literal character by itself is a complete pattern. Or if we were to create a sub pattern, that's a complete pattern. So we could repeat that. So in this case it's saying find f and o, then you can either find an o once or none as long as there's a bar. So if I come out here and we recreate that, so if I say f, o, o question bar, notice it matched. It did match foobar cause there's one o but it also matched, so let's go here, fobar because there's not a second o.

So it says find either one instance of it or find zero instances. Now there is another warning. And that is that quantifiers turn on greediness. Greediness means that for these quantifiers, the regular expression engine is going to attempt to find as many instances of that pattern as it can until it no longer can find matches. So, if we take a similar example where we've got it's raining cats and dogs and we do a pattern of c, a, t, s question, meaning the s in cat is optional, well, it's always gonna try to find cats instead of just cat if it can. It's almost like we flipped the alternation and we moved cats from the right hand side to the left hand side. So it's just important to remember that with this greediness, you have to be careful with the quantifiers because you might end up matching things you didn't anticipate. And I'll show you an example of that in just a second. The next quantifier we have or next special character is the asterisks. The asterisk says match that preceding token either zero times or infinity times.

So the question is zero or once, the asterisk is zero to as many times as possible. So if I come back over and change that question to an o, not to an o, excuse me, to an asterisk, excuse me. We are still matching fobar because it says you can match it zero times, but I'm also matching fooooobar because I've said you can match as many o's as you want. The reason this is in that greediness, coming back to that greediness,/ is if I were to use something like a dot star, we'll notice now, not only did I match fobar and fooooobar, I also now matched foobarbar, which isn't an o at all, right. Cause that dot stands for any literal character. The asterisk says match the previous as many times as you want, so it's just matched anything as long as there's a b, a, r at the end of the pattern. So just be aware of that greediness as you're using these quantifiers. Alright, the next one is a plus sign. So, it's another quantifier that says the previous token has to be matched at least once, but then can match as many times as we want or as many times as possible.

So if I change these back to a second o and a plus, well, now I'm matching foobar, but I'm also matching fooooobar. Alright, now sometimes we're gonna need more fine grained control over the number of times that we're repeating things, and that's where the curly brace comes into play. So the beginning or the opening curly brace combined with a closing curly brace allows us to specify a specific minimum number and maximum number of repetitions. So the pattern is min, max where the min is a zero or a positive integer indicating the minimum times we have to match. The max min is equal to or greater than the min, also positive integer indicating the maximum number we can match. If we drop off the max but leave the comma, then max becomes infinity. So if you think back to the other special characters, other quantifier characters, the question mark is really the same as zero, one. As we said, you can find it zero times or once. The plus, excuse me, the asterisks is the same as saying zero comma, nothing or infinity, which means the plus is saying the same thing as one, infinity.

Then if you omit the comma and the max, then we designate a specific number of times. So if we had {2}, that's saying it has to repeat exactly twice. So if I were to come back here and use these curlies, maybe if I said, you know, foo, you have to find between two to three instances of that first (UNKNOWN) matched here. Or I could change that, I'd say something like five and now I've extended that out and matched all five of those. Right, so if you're not seeing how you might use this yet, we're gonna begin. We're gonna go into some other patterns here in a second, and I think you'll begin to see that. And just a note that the ending curly brace by itself is not a special character. It's only a special character when used in conjunction with that beginning curly brace, just like the opening and closing square brackets. So that's ten of 12. Alright, so we got two left. The next special character that we have are the opening and closing parentheses. This allows us to create those sub patterns or group that I mentioned earlier.

And I really like the example of trying to match both the American English and the UK English spelling of theater. So we, in the United States we spell it t-h-e-a-t-e-r, in UK they spell a t-h-e-a-t-r-e. And if we wanted to match both, well, you know, it'd be kind of hard. If I had, you know, I could do e question r and that gets closed, then I put another e and r but that's kind of messy. So instead what I can do is I can say, hey, I need a sub-pattern or a group and I wanna find e, r or use that alternator and find r, e. Now notice I'm matching theater, theatre. Or we could do something like maybe I want to find dish or dishes. And now I could add a quantifier and say, alright, find dish and then find es zero or one time and sure enough, I'm matching dish or dishes. Makes sense? Hopefully so. Sorry, I can't see anybody, so I'm hoping so. Alright, the last set of special characters that we have access to are the open and closing parentheses and these allow us to create capturing groups.

So if you're listening in, I've got an animation of Rudy Huxtable, who is about four or five years old. She's smacking herself in the forehead and saying, "not again". Well it's not quite as bad. All groups by default are capturing groups. And what I mean by a capturing group is when you create a group in these little sub patterns, the regular expression engine is going to hold on to that match in memory, allowing you to access it again later on. Now we access them in the order that they were created. So if we've got group one, group two, group three, then we match, we access them ordinally. So we access group one, group two, group three, etc. This allows us to do things with the group. So if you remember back earlier to my example of the phone numbers. Well, I created groups where I matched on digits and then later access those matches, those captured sub patterns to recreate the string. So if I, went too far, there we go. So if I come back in here, maybe I'm gonna match foo and bar.

Well, if I come in to maybe substitution, I can say, "hey, I want you to replace what, where I matched, I want you to replace it with dash dash dollar one. And depending on the regular expression engine, how you access that group will depend. In this case, it uses the dollar sign plus the ordinal position to access it. And so now I've matched on bar, it captured the bar and now I'm replacing it with dash dash, then the captured piece, dash dash. Hopefully this makes sense. We're gonna use all of these again in just a second. We're gonna use all these special characters to play our game. So here are the 12 special characters. Normally, in a live demonstration of this, I would do a pop quiz and quiz you on these. Instead, I'm gonna go back over them real fast and kind of give you a summary. So the backslashes are, excuse me, is that a (UNKNOWN). It's our escape character. It tells the regular expression engine that the next character that follows to not use its regular meaning, but use its alternate meaning.

The caret symbol by itself is an anchor that tells the regular expression engine the pattern must be at the beginning of a string or line. The dollar sign is also an anchor, but it says match to the end of a string or a line. Then we have the square bracket, which begins a character class where we're gonna tell the regular expression you wanna match one of the characters inside the character class. The caret symbol used inside the character class negates the character class. The dot then stands for any single literal character except for new lines. The pipe creates an alternation where we're telling the regular expression engine, I want you to find one of two things. The question mark is a quantifier where it says, I want you to match the previous token zero or one time, the asterisks zero to infinity, the plus, one to infinity. Or if we need more fine grained control, we've got the curly brace where we can designate a minimum and maximum number of repetitions (UNKNOWN). And then we've got the opening closing parentheses that allows us to create groups.

And not only can we create sub patterns in those groups, we can re-access that found information from those groups as capture groups. Alright. That's all. (UNKNOWN) I take a big, deep breath, OK. Cause I got one more bonus for you. This bonus is more advanced. The idea here is if you don't understand it, it's totally OK. That is not the purpose. I want you to be exposed to this concept because, one, we're gonna use it in the game. But second, I want you to have it in the back of your mind. For the future if you're using a regular expression and it's not quite matching what you want and then you can think back, oh yeah, there was something that Paul talked about. Concept is called a look around. A look around are sometimes referred to as zero-length assertions. They don't necessarily match something. They don't create a group. It's not a sub pattern. It simply says, is something true or false? So they're kind of similar to the anchor. Does the pattern, is the pattern at the beginning of the line, is it at the end of the line?

So it's similar to that. There are two main types. There's a look ahead and a look behind. So we can say, "hey, from where I am, regular expression engine, I want you to look ahead into the string and see if something is positive or isn't negative there, or I want you to start, from where I am in the pattern. I want you to look behind me from where you've matched and see if something is or isn't there." Now, if you think back to the example I did earlier of the wolves, right. I wanted to find brown wolves and gray wolves or even the word wolf, but not red. That was an example of a negative look behind. So I want you to find the word wolf, and then I want you to look behind and make sure that it did not, it was not preceded by the word red. Now, if it's still a little fuzzy, that's OK. It took me quite a while to kind of grasp this concept. And what finally did it for me was I had a situation where I needed to match the letter q in words where it was not followed by u. Now we have some tools available to us already from what we've just learned, right.

How do we create something that says don't find this thing? Well we use a character, a negated character class. So I could say, OK, I wanna find a q and then I want you to find something that's not a u. And that gets us close, right. I've got qwerty, there's no u there, qinter, no u, tariqat. But notice it didn't match Iraq because the character has to match, excuse me, the character class has to match one, at least one literal character from the group. And there isn't anything after that q. So it didn't match Iraq. Also notice that it's matching that second letter. And I just wanna match the q in the word, right? So this is where a negative lookahead comes in. I'm saying I want you to find a q and then I want you to stop. And then I want you to tell me, I want you to look ahead and tell Me if the next character is a u or is it not a U. And I only wanna match those that are not a u. We create look arounds using the parentheses. So again, it's that changing of what these things stand for.

We create a look around then by using the question mark inside and I say creating negative lookahead using the exclamation and then say don't find a u or make sure that u does not follow. And notice now I've got q in qwerty, q in qintars, tariqat, cinqs and Iraq. Alright, hopefully this is making a little bit of sense because we are gonna need it for the game. The game is we're gonna play Wordle. If you've not played Wordle, I explain it in just a second. And it's not to solve the wordle, although we are going to solve the wordle, the goal of what we're going to do is to try to build a regular expression that takes us from 600,000 words in the English dictionary down to a very small collection that would actually solve the puzzle. So we're solving the wordle. Now, normally in a live demonstration, I would have you and you only build the regular expression. I wouldn't touch it at all, you'd have to build it. But in this case, I'm just gonna have to walk you through the steps. Hopefully it's making, its gonna make sense.

So I'm gonna bring up Wordle. I've got it down here somewhere. There it is. Alright, if you have never played Wordle before, we are trying to guess a five letter word in (UNKNOWN) math. So it's all, the case doesn't matter, but it's not gonna be proper nouns. It's all lowercase, five letter words. If we guess the word and I'll apologize, I'm colorblind. So I don't know if this is orange or green. It's, feel free to laugh at me, that's totally OK. But if it's this color, that means the letter is in the right location for the word. If it is blue, that means the letter is used in the word but not in that position. And if it's gray, it means that letter is nowhere in that word. Alright. If you're following along, in my repo, I've got a file called (UNKNOWN). It, excuse me. It contains a collection, a whole bunch of, I think 13,000 something words that we can use to guess inside Wordle. But let's pretend for a moment that we didn't have this list. This was actually 600,000. We already have the tools from what we've seen, available to limit that selection, that 600,000 down to only five letter lowercase words.

So the first thing we always wanna do is we always wanna make sure we have anchors. And in this case, I wanna capture the group so that I can display it down here. And we know that we only want lowercase alpha characters. So I could create a character class of a through z and I know I need exactly five of those. So, you already have all the tools you need to do 600,000 down to, if you'll notice 12,971. So you've already limited it using a regular expression all those words down to a much smaller collection. Alright, so let's start the puzzle. Oops, I went to the wrong one, there we go. In fact, I'm gonna minimize this window temporarily so we can flip back and forth much easier. OK, so I usually start with the word audio. I start with the word audio cause that knocks out a-u-i and o, just leaving me an e just to kind of give me some idea of how I need to adjust the regular expression. So we can see, alright. I don't the letters a, u, d and i are not used in the word at all. The letter o is used in the word and it's the very last character.

So, right off the bat, I can come back to my regular expression. And I know the last character is an o. No matches, alright, cause I didn't tell it how many, all we know we need four characters before it. There we go, we got 389. But we also know that it should not be a, u, d or i. And we use that character class for a, u, d and i, we negate it. And now we've gone from 600,000 to 13,000 to 110 matches in one guess, two regular expressions. This is usually where there's a big round of applause and cheers and everybody like, wow, that's amazing. Alright, so we now need to guess a second word to see how we can limit this even more. So I'm just gonna grab the first one, Bento. So I'm gonna go b,e,n, oops. I think I hit a space t and o, and let's see what happens. OK, so now we know we don't have b or n. So I'm gonna go ahead and add that into the negated character class cause we know we don't have those. And then I know the first, or is that the second one is an e, so I'm gonna say we don't need four them anymore.

We know the second one is an e then I'll replicate not a, u, d, i, b or n. So we have one, two, three and four and we know the fourth character is not a t, but the t is used. So I'm gonna do (UNKNOWN) three and then I'm gonna say not a, u, d, i, b, n or t. Alright, so we've gone from 600,000 to 13,000 to 110 to 21. But we also know that the t has to be there somewhere. So if you think back to what I just showed you with the look arounds, I could say I want you to look ahead, that's the question input. And I want you to make sure that a t is used in the word somewhere. So this is saying look forward into the pattern and ensure that a t is somewhere in the five letter word. And now we're down to five matches. So we've had two word guesses, we've built three regular expressions, and we've already gone from 600,000 to five. So now we're down to, I've never heard of the word (UNKNOWN). I'm gonna guess tempo, t-e-m-p-o. Oh, OK, so we know we still have a t. We now know the third one. We can't have a p.

We know the third one can't be an m so I could go add the p to all of them. Although at this point it's pretty clear on what the word is gonna be. And then we said the third one can't be an m. So now we're down to metho, metro or retro. So I'm gonna try metro, m-e-t-r-o. And we solved it. OK, in what? Four guesses. Cool, alright. So, there's applause I hear. (UNKNOWN) perfect. So, hopefully you can see and again, this is a real, doing this wordle is a really good way of forcing yourself to see the information you have and adjust the regular expression to find the matches that you want. So, if I can come back in here real quick, where is, there's mine. SO, if you (UNKNOWN). So if you liked that and I hope you did and you wanna try this on your own, there are multiple wordle clones and I would encourage you to try that. You know, grab this dictionary collection of words, try to build a regular expression that limits that number and go over and over it again. Another great one is, there is this crossword puzzle.

And you'll have access both to this presentation deck so you don't have to copy these down. But this is another great puzzle where you're given regular expressions in a crossword and you have to try to guess the word based on what the regular expression is. So another great way to kind of build up your regular expression skills. Now, I wanna pause for just a moment and acknowledge some resources that are phenomenal and have been extremely beneficial to my own learning, and that is regular expressions info and rexegg, I can never say that, rexegg.com, both phenomenal sources for learning regular expressions as well as Carl Alexander's site and he has an article on beginner's guide to regular expressions. All phenomenal sources, I would not be able to do nearly as much regular expressions as I could without these sources. And I think we have some time for questions. As we're getting those questions up and over to me, I'll put this up here. I'm odd in that I love regular expressions. I really do.

I think they're fun to do, I see them as a giant puzzle. So feel free to reach out to me at any point, if you have questions, you know. If you have something about the presentation that you're still struggling with or you're trying to write a regular expression, don't hesitate to reach out to me and ask. I'd be happy to help out. So do we have some questions? I'm in the YouTube chat here, let me check. Not in the YouTube chat just yet, but if you are listening in there, feel free to post in that chat and I'll read it off. Anybody in the room here have a question?

STUDENT:
What are some other uses of regular expressions besides solving wordle?

SPEAKER:
The question was what are some other uses for regular expressions beyond solving wordle?

PAUL GILZOW:
Yeah, so a big one that you'll probably run into, especially in the Drupal space, are rewrite and redirect rules inside of either (UNKNOWN). You know, most of our sites are living and they change a lot and we have old URLs that get removed or they get moved somewhere else or you know, redirects have to change. And so you end up having to do quite a bit of regex inside of there. I've used (UNKNOWN) in work docs, excel sheets, databases. I do a ton of pattern matching for forms to make sure the data is the way I need it. I've done a lot of data manipulation, kind of like I showed you with the phone number where I've got data coming in that's not clean and I use regular expressions to clean up that data. So I guess, you don't wanna use it for everything again. But there are a plethora of areas where it can be an extremely beneficial tool to have access to.

STUDENT:
Explain the difference between a (UNKNOWN).

SPEAKER:
The question was, could you explain the difference between a capturing group and a non capturing group? And when would you wanna use one versus the other?

PAUL GILZOW:
So to do a non capture group, why am I not looking in my screen? Oh, I was as on zoom, OK. So to do a non capture group pattern is question colon and that just tells the regex expression engine don't store this into memory. You would use it when you need to find, you need to have a sub pattern but you don't necessarily need to capture the information. Because when you capture information, it is storing it in memory and regular expressions, especially when you've got, you know, 600,000, a million lines of text is trying to parse through, that's a lot to try to hold into memory if you have a ton of groups as well. So, if you don't need the information later on, it's a good idea to make those groups non capturing just so it's not trying to hold on to that information. In terms of when, you know, back to the (UNKNOWN) access rules, it could be that you're saying, you know, I wanna match, when the request comes in for either dub dub dub or non dub dub dub, you know, maybe the dub dub dub doesn't matter, but you need to have it as a sub pattern.

So you can say this is a non capturing group. I don't, I'm not gonna do anything with it, I just need it as a sub pattern. Hopefully that answered the question.

SPEAKER:
(UNKNOWN)

PAUL GILZOW:
Good. Good, OK.

SPEAKER:
Anyone else? It looks like that's it for questions in the room here.

PAUL GILZOW:
Alright. Well, thank you very much. Again, don't hesitate to reach out to me if you've got questions. I've loved talking about this stuff.

MidCamp 2023

Regex: Demystifying the Hieroglyphics

Organizer Note

Description

Paul Gilzow

DePaul University - Lincoln Park Student Center

Thank You to our Core Sponsors