An Introduction to Regex in Python

An Introduction to Regex in Python

An Introduction to Regex in Python

What is Regex?

Regex stands for Regular Expression* and essentially is an easy *way to define a pattern of characters. Regex is mostly used in pattern identification, text mining or input validation.

Regex puts a lot of people off, because it looks like gibberish on first look; as for the people who know how to use it, they can’t seem to stop! It’s a very powerful tool that is worth learning about if you don’t already know.

Introduction to Regex

The first thing you need to know about regex, is that you can match a specific character or words.

Let’s assume, that we want to know whether a specific string, contains the letter ‘a’ or word ‘lot’. If that is the case, we can use the following python code:

import re
str = "Learning regex can be a lot of fun"
lst = re.findall('a', str)
lst2 = re.findall('lot', str)
print(lst)
print(lst2)

which will return, a list with 3 matches and a list of 1:

['a', 'a', 'a']
['lot']

Keeping our set up the same, imagine that you want to search for the following 3 letters in any order a, b or c. You can use a list, by using square brackets:

lst = re.findall('[abc]', str)
lst2 = re.findall('[a-c]', str)
print(lst)
print(lst2)

returning:

['a', 'c', 'a', 'b', 'a']
['a', 'c', 'a', 'b', 'a']

Photo by Dayne Topkin on Unsplash

The Regex Cheat Sheet

Every time I am about to write a complicate regular expression, my first port of contact is the following list, by Dr Chuck Severance:

Python Regular Expression Quick Guide

^        Matches the beginning of a line
$        Matches the end of the line
.        Matches any character
\s       Matches whitespace
\S       Matches any non-whitespace character
*        Repeats a character zero or more times
*?       Repeats a character zero or more times 
         (non-greedy)
+        Repeats a character one or more times
+?       Repeats a character one or more times 
         (non-greedy)
[aeiou]  Matches a single character in the listed set
[^XYZ]   Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
(        Indicates where string extraction is to start
)        Indicates where string extraction is to end

Using the above cheat sheet as a guide, you can pretty much come up with any syntax. Let’s take a closer look in some more complicate search patterns.

Stepping it Up

Imagine that you are building some sort of validation on an input field where the user can input any number followed by the letter d, m or y.

Your regex algorithm would look something like this:

^[0-9]+[dmy]$

Decomposing the above: ^ signifies the beginning of the match followed by a 0–9 number. However the + sign means it needs to be at least one 0–9 number though there can be more. Then the string needs to be followed by d, m or y, which have to be at the end because of $.

Testing the above in python:

import re
str = '1d'
str2 = '200y'
str3 = 'y200'
lst = re.findall('^[0-9]+[dmy]$', str)
lst2 = re.findall('^[0-9]+[dmy]$', str2)
lst3 = re.findall('^[0-9]+[dmy]$', str3)
print(lst)
print(lst2)
print(lst3)

Returning:

['1d']
['200y']
[]

Photo by Arget on Unsplash

Escaping Special Characters

When it comes to regular expressions, certain characters are special. For instance, dot, star and dollar sign are all used for matching purposes. So what happens if you want to match those characters?

In that case, we can use the back slash.

import re
str = 'Sentences have dots. How do we escape them?'
lst = re.findall('.', str)
lst1 = re.findall('\.', str)
print(lst)
print(lst1)

The above example is using dot, and backslash dot. As you would expect, it returns two results. The first one matches all characters, while the second one, only the dot.

['S', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', ' ', 'h', 'a', 'v', 'e', ' ', 'd', 'o', 't', 's', '.', ' ', 'H', 'o', 'w', ' ', 'd', 'o', ' ', 'w', 'e', ' ', 'e', 's', 'c', 'a', 'p', 'e', ' ', 't', 'h', 'e', 'm', '?']
['.']

Matching exact number of characters

Imagine that you want to match a date. You know that what the format will be, DD/MM/YYYY. Sometimes there will be 2Ds or 2Ms, sometimes just one, but always 4Ys.

import re
str = 'The date is 22/10/2018'
str1 = 'The date is 3/1/2019'
lst = re.findall('[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}', str)
lst = re.findall('[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}', str1)
print(lst)
print(lst1)

Which gives the following results:

['22/10/2018']
['3/1/2019']

Extracting the matched pattern

There are certain times, that knowing the fact that you’re matching a pattern is not enough. You want to have the ability to extract information from the match.

For instance, imagine that you are scanning a large data set looking for email addresses. If you use what we learnt about, you could search for a pattern of:

  • Could start with a letter, number, dot or underscore
  • Then followed by at least another letter, or number
  • Which could be followed by a dot or an underscore
  • Then there’s a @
  • Then follow the same logic again as before the @
  • Finally look for a dot followed by at least a letter
^[a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\@[a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\.[a-zA-z]+

From the above match, you only want to extract the domain name ie everything after the @. All you have to do is add brackets around what you’re after:

import re
str = '[email protected]'
lst = re.findall('^[a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\@([a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\.[a-zA-z]+)', str)
print(lst)

Returning:

['gmail.com']

In Summary

In summary, you can use regex to match strings of data and it can be used in a number of different ways. Python includes a regex package called re, which will allow you to use this. Should you find yourself on a Unix machine however, you can use regular expression along with grep, awk or sed. On Windows should you want to access all these commands, you can use tools like Cygwin.

Thanks for reading ❤

If you liked this post, share it with all of your programming buddies!

python regex

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Introduction to Python RegEx | What is Python RegEx | Python Training

Python RegEx tutorial will help you in understanding how to use regular expressions in Python. You will get to learn different regular expression operations and syntaxes.

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

How To Compare Tesla and Ford Company By Using Magic Methods in Python

Magic Methods are the special methods which gives us the ability to access built in syntactical features such as ‘<’, ‘>’, ‘==’, ‘+’ etc.. You must have worked with such methods without knowing them to be as magic methods. Magic methods can be identified with their names which start with __ and ends with __ like __init__, __call__, __str__ etc. These methods are also called Dunder Methods, because of their name starting and ending with Double Underscore (Dunder).

Python Programming: A Beginner’s Guide

Python is an interpreted, high-level, powerful general-purpose programming language. You may ask, Python’s a snake right? and Why is this programming language named after it?

Python regex: How to Use Regular Expressions in Python

Python regex is the sequence of characters that forms the search pattern. It is used to check if the string contains the search pattern.