Regular Expressions (RegEx) in Python

Regular expressions, aka regex, is incredibly common to help us parse data. Before we discuss how, let’s consider a practical example by using US Phone numbers. The following are all valid written phone number formats:

+1-555-555-3121
1-555-555-3121
555-555-3121
+1(555)-555-3121
+15555553121

It’s amazing that all of these numbers are the exact same just formatted slightly different. So how would we search a whole document for all possible derivations of phone number format?

“Machine learning!” you say. Well, that would probably work but it’s overcomplicating this particular challenge. Instead, we can use pattern matching, aka regular expressions, to simplify the challenge.

Regular expressions are intimidating and take some time to wrap your head around. So I created this guide as a way to unpack how to effectively use Regular Expressions in Python. Many of these regex patterns and concepts overlap to other languages especially since Python regex was inspired by Perl.

Let’s look at some code.

my_phone_number = "555-867-5309"

How do we get all the numbers (not the dashes -) from the above string? Let’s first talk about the harder and more amateur way to do it:

numbers = []
for char in my_phone_number:
    number_val = None
    try:
        number_val = int(char)
    except:
        pass
    if number_val != None:
        numbers.append(number_val)

numbers_as_str = "".join([f"{x}" for x in numbers])
numbers_as_str
'5558675309'

Here’s another way your intuition make take you:

numbers_as_str2 = my_phone_number.replace("-", "")
numbers_as_str2
'5558675309'

Finally, Python Strings (str) have a built-in method .isdigit() that can be applied to verify if the string contains a number or not. Here’s how it’s done:

numbers_as_str3 = "".join([f"{x}" for x in my_phone_number if x.isdigit()])
numbers_as_str3
'5558675309'

All of these methods are valid in that they achieve the goal but there’s a more practical and robust way to do this.

And that’s with regular expressions. Let’s see our first regex example:

import re ## the built-in regex library

pattern = r"\d+"
matches = re.findall(pattern, my_phone_number)
matches
['555', '867', '5309']

The re.findAll method actually illustrates a much better result --> each group of numbers has been parsed out by default.

To me, it’s easier to infer that ['555', '867', '5309'] is a phone number over something like 5558675309. That’s because I’m from the USA and that’s how we typically group numbers.

We still haven’t gotten to the core reason as to why we use regex. Let’s think of another example.

my_other_phone_numbers = "Hi there, my home number is 555-867-5309 and my cell number is +1-555-555-0007."

pattern = r"\d+"
matches = re.findall(pattern, my_other_phone_numbers)
matches
['555', '867', '5309', '1', '555', '555', '0007']

The numbers ['555', '867', '5309', '1', '555', '555', '0007'] are much more challenging to distinguish a list of phone numbers within a string. The length of that string was only 79 characters (including spaces/punctuation). Imagine if we had thousands of characters?

What to do? The answer, again is regex. And this is where regex really shines.

The reason for this is we’re looking for a specific pattern to parse in our text; not just digits. We actually want to ignore digits that don’t match this pattern. Say, for instance, I gave you a time and my phone number:

meeting_str = "Hey, give me a call at 8:30 on my cell at +1-555-555-0007."

If we try to only extract digits, we’ll get a few extra we don’t need. Take a look:

pattern = r"\d+"
matches2 = re.findall(pattern, meeting_str)
matches2
['8', '30', '1', '555', '555', '0007']

So what we need to do is improve our regular expression pattern. Let’s see how:

phone_pattern = r"\+\d{1}-\d{3}-\d{3}-\d{4}"
matches3 = re.findall(phone_pattern, meeting_str)
matches3
['+1-555-555-0007']

Whoa. Now, you’ve really lost me. What the heck is r"\+\d{1}-\d{3}-\d{3}-\d{4}"?

To match any digit, you use the string r"\d". The r in the front signifies this is a regular expression. The \d is the pattern to match any number digit. I’ll explain the curly braces parts in a minute but let’s dive into the \d a bit more.

numbers_with_decimals = r"\d+\.\d+"
matches4 = re.findall(numbers_with_decimals, "123.122")
no_matches = re.findall(numbers_with_decimals, "12")
print(matches4, no_matches)
['123.122'] []

The last two patterns we saw something strange with the characters + and .. That’s because regex treats these characters differently than English does. So if our regex pattern needs to use + or . we have to escape them with \+ and \. respectively.

#python #programming #developer #regex

codingforentrepreneurs.com

Regular Expressions (RegEx) in Python