An Introduction to Regex in Python

What is Regex?

Regex stands for Regular Expression and essentially is an *easy *way to define a pattern of characters. Regex is mostly used in pattern identification, text mining or input validation.

Regex puts a lot of people off, because it looks like gibberish on first look; as for the people who know how to use it, they can’t seem to stop! It’s a very powerful tool that is worth learning about if you don’t already know.

Introduction to Regex

The first thing you need to know about regex, is that you can match a specific character or words.

Let’s assume, that we want to know whether a specific string, contains the letter ‘a’ or word ‘lot’. If that is the case, we can use the following python code:

import re
str = "Learning regex can be a lot of fun"
lst = re.findall('a', str)
lst2 = re.findall('lot', str)
print(lst)
print(lst2)

which will return, a list with 3 matches and a list of 1:

['a', 'a', 'a']
['lot']

Keeping our set up the same, imagine that you want to search for the following 3 letters in any order a, b or c. You can use a list, by using square brackets:

lst = re.findall('[abc]', str)
lst2 = re.findall('[a-c]', str)
print(lst)
print(lst2)

returning:

['a', 'c', 'a', 'b', 'a']
['a', 'c', 'a', 'b', 'a']

Photo by Dayne Topkin on Unsplash

The Regex Cheat Sheet

Every time I am about to write a complicate regular expression, my first port of contact is the following list, by Dr Chuck Severance:

Python Regular Expression Quick Guide

^        Matches the beginning of a line
$        Matches the end of the line
.        Matches any character
\s       Matches whitespace
\S       Matches any non-whitespace character
*        Repeats a character zero or more times
*?       Repeats a character zero or more times 
         (non-greedy)
+        Repeats a character one or more times
+?       Repeats a character one or more times 
         (non-greedy)
[aeiou]  Matches a single character in the listed set
[^XYZ]   Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
(        Indicates where string extraction is to start
)        Indicates where string extraction is to end

Using the above cheat sheet as a guide, you can pretty much come up with any syntax. Let’s take a closer look in some more complicate search patterns.

Stepping it Up

Imagine that you are building some sort of validation on an input field where the user can input any number followed by the letter d, m or y.

Your regex algorithm would look something like this:

^[0-9]+[dmy]$

Decomposing the above: ^ signifies the beginning of the match followed by a 0–9 number. However the + sign means it needs to be at least one 0–9 number though there can be more. Then the string needs to be followed by d, m or y, which have to be at the end because of $.

Testing the above in python:

import re
str = '1d'
str2 = '200y'
str3 = 'y200'
lst = re.findall('^[0-9]+[dmy]$', str)
lst2 = re.findall('^[0-9]+[dmy]$', str2)
lst3 = re.findall('^[0-9]+[dmy]$', str3)
print(lst)
print(lst2)
print(lst3)

Returning:

['1d']
['200y']
[]

Photo by Arget on Unsplash

Escaping Special Characters

When it comes to regular expressions, certain characters are special. For instance, dot, star and dollar sign are all used for matching purposes. So what happens if you want to match those characters?

In that case, we can use the back slash.

import re
str = 'Sentences have dots. How do we escape them?'
lst = re.findall('.', str)
lst1 = re.findall('\.', str)
print(lst)
print(lst1)

The above example is using dot, and backslash dot. As you would expect, it returns two results. The first one matches all characters, while the second one, only the dot.

['S', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', ' ', 'h', 'a', 'v', 'e', ' ', 'd', 'o', 't', 's', '.', ' ', 'H', 'o', 'w', ' ', 'd', 'o', ' ', 'w', 'e', ' ', 'e', 's', 'c', 'a', 'p', 'e', ' ', 't', 'h', 'e', 'm', '?']
['.']

Matching exact number of characters

Imagine that you want to match a date. You know that what the format will be, DD/MM/YYYY. Sometimes there will be 2Ds or 2Ms, sometimes just one, but always 4Ys.

import re
str = 'The date is 22/10/2018'
str1 = 'The date is 3/1/2019'
lst = re.findall('[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}', str)
lst = re.findall('[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}', str1)
print(lst)
print(lst1)

Which gives the following results:

['22/10/2018']
['3/1/2019']

Extracting the matched pattern

There are certain times, that knowing the fact that you’re matching a pattern is not enough. You want to have the ability to extract information from the match.

For instance, imagine that you are scanning a large data set looking for email addresses. If you use what we learnt about, you could search for a pattern of:

Could start with a letter, number, dot or underscore
Then followed by at least another letter, or number
Which could be followed by a dot or an underscore
Then there’s a @
Then follow the same logic again as before the @
Finally look for a dot followed by at least a letter

^[a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\@[a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\.[a-zA-z]+

From the above match, you only want to extract the domain name ie everything after the @. All you have to do is add brackets around what you’re after:

import re
str = 'email123_test@gmail.com'
lst = re.findall('^[a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\@([a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\.[a-zA-z]+)', str)
print(lst)

Returning:

['gmail.com']

In Summary

In summary, you can use regex to match strings of data and it can be used in a number of different ways. Python includes a regex package called re, which will allow you to use this. Should you find yourself on a Unix machine however, you can use regular expression along with grep, awk or sed. On Windows should you want to access all these commands, you can use tools like Cygwin.

Thanks for reading ❤

If you liked this post, share it with all of your programming buddies!

#python #regex