Capturing dates of a particular format from raw data

I'm trying to capture dates of the form -

I'm trying to capture dates of the form -

20 Apr 2009

20 April 2009

20 Apr. 2009

20 April, 2009

...from raw text in a pandas dataframe. I want to get rid of rest of the text apart from the dates

I'm been partially successful in my attempt

df['some_column'] = df['some_column'].str.replace(r'(.*?)(\d{1,2}[ ](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\.?,?[ ]\d{4})(.*?)\n\1', lambda x: x.groups()[1])

But for some cases I'm getting the preceding/succedding text as well .. Any inputs would be appreciated

An Introduction to Regex in Python

An Introduction to Regex in Python

An Introduction to Regex in Python

What is Regex?

Regex stands for Regular Expression and essentially is an *easy *way to define a pattern of characters. Regex is mostly used in pattern identification, text mining or input validation.

Regex puts a lot of people off, because it looks like gibberish on first look; as for the people who know how to use it, they can’t seem to stop! It’s a very powerful tool that is worth learning about if you don’t already know.

Introduction to Regex

The first thing you need to know about regex, is that you can match a specific character or words.

Let’s assume, that we want to know whether a specific string, contains the letter ‘a’ or word ‘lot’. If that is the case, we can use the following python code:

import re
str = "Learning regex can be a lot of fun"
lst = re.findall('a', str)
lst2 = re.findall('lot', str)
print(lst)
print(lst2)

which will return, a list with 3 matches and a list of 1:

['a', 'a', 'a']
['lot']

Keeping our set up the same, imagine that you want to search for the following 3 letters in any order a, b or c. You can use a list, by using square brackets:

lst = re.findall('[abc]', str)
lst2 = re.findall('[a-c]', str)
print(lst)
print(lst2)

returning:

['a', 'c', 'a', 'b', 'a']
['a', 'c', 'a', 'b', 'a']

Photo by Dayne Topkin on Unsplash

The Regex Cheat Sheet

Every time I am about to write a complicate regular expression, my first port of contact is the following list, by Dr Chuck Severance:

Python Regular Expression Quick Guide

^        Matches the beginning of a line
$        Matches the end of the line
.        Matches any character
\s       Matches whitespace
\S       Matches any non-whitespace character
*        Repeats a character zero or more times
*?       Repeats a character zero or more times 
         (non-greedy)
+        Repeats a character one or more times
+?       Repeats a character one or more times 
         (non-greedy)
[aeiou]  Matches a single character in the listed set
[^XYZ]   Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
(        Indicates where string extraction is to start
)        Indicates where string extraction is to end

Using the above cheat sheet as a guide, you can pretty much come up with any syntax. Let’s take a closer look in some more complicate search patterns.

Stepping it Up

Imagine that you are building some sort of validation on an input field where the user can input any number followed by the letter d, m or y.

Your regex algorithm would look something like this:

^[0-9]+[dmy]$

Decomposing the above: ^ signifies the beginning of the match followed by a 0–9 number. However the + sign means it needs to be at least one 0–9 number though there can be more. Then the string needs to be followed by d, m or y, which have to be at the end because of $.

Testing the above in python:

import re
str = '1d'
str2 = '200y'
str3 = 'y200'
lst = re.findall('^[0-9]+[dmy]$', str)
lst2 = re.findall('^[0-9]+[dmy]$', str2)
lst3 = re.findall('^[0-9]+[dmy]$', str3)
print(lst)
print(lst2)
print(lst3)

Returning:

['1d']
['200y']
[]

Photo by Arget on Unsplash

Escaping Special Characters

When it comes to regular expressions, certain characters are special. For instance, dot, star and dollar sign are all used for matching purposes. So what happens if you want to match those characters?

In that case, we can use the back slash.

import re
str = 'Sentences have dots. How do we escape them?'
lst = re.findall('.', str)
lst1 = re.findall('\.', str)
print(lst)
print(lst1)

The above example is using dot, and backslash dot. As you would expect, it returns two results. The first one matches all characters, while the second one, only the dot.

['S', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', ' ', 'h', 'a', 'v', 'e', ' ', 'd', 'o', 't', 's', '.', ' ', 'H', 'o', 'w', ' ', 'd', 'o', ' ', 'w', 'e', ' ', 'e', 's', 'c', 'a', 'p', 'e', ' ', 't', 'h', 'e', 'm', '?']
['.']

Matching exact number of characters

Imagine that you want to match a date. You know that what the format will be, DD/MM/YYYY. Sometimes there will be 2Ds or 2Ms, sometimes just one, but always 4Ys.

import re
str = 'The date is 22/10/2018'
str1 = 'The date is 3/1/2019'
lst = re.findall('[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}', str)
lst = re.findall('[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}', str1)
print(lst)
print(lst1)

Which gives the following results:

['22/10/2018']
['3/1/2019']

Extracting the matched pattern

There are certain times, that knowing the fact that you’re matching a pattern is not enough. You want to have the ability to extract information from the match.

For instance, imagine that you are scanning a large data set looking for email addresses. If you use what we learnt about, you could search for a pattern of:

  • Could start with a letter, number, dot or underscore
  • Then followed by at least another letter, or number
  • Which could be followed by a dot or an underscore
  • Then there’s a @
  • Then follow the same logic again as before the @
  • Finally look for a dot followed by at least a letter
^[a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\@[a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\.[a-zA-z]+

From the above match, you only want to extract the domain name ie everything after the @. All you have to do is add brackets around what you’re after:

import re
str = '[email protected]'
lst = re.findall('^[a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\@([a-zA-Z0-9\.\_]*[a-zA-Z0-9]+[\.\_]*\.[a-zA-z]+)', str)
print(lst)

Returning:

['gmail.com']

In Summary

In summary, you can use regex to match strings of data and it can be used in a number of different ways. Python includes a regex package called re, which will allow you to use this. Should you find yourself on a Unix machine however, you can use regular expression along with grep, awk or sed. On Windows should you want to access all these commands, you can use tools like Cygwin.

Thanks for reading ❤

If you liked this post, share it with all of your programming buddies!

How to create a regex that accepts specific characters?

I have this regex:

I have this regex:

^[[email protected]#$%&'*+-/=?^`{|}~!(),:;<>[-\]]{8,}$

I need a regex to accept a minimum word length of 8, letters(uppercase & lowercase), numbers and these characters:

!#$%&'*+-/=?^_`{|}~"(),:;<>@[]

It works when I tested it here.

This is how I used it in Java Android.

public static final String regex = "^[[email protected]#$%&'*+-/=?^`{|}~!(),:;<>[-\\]]{8,}$";

This is the error that I received.

java.util.regex.PatternSyntaxException: Missing closing bracket in character class near index 49
    ^[a-zA-Z0[email protected]#$%&'*+-/=?^`{|}~!(),:;<>[-\]]{8,}$


An Introduction to Regex for Web Developers

An Introduction to Regex for Web Developers

An Introduction to Regex (Regular Expressions) for Web Developers

This was originally posted as a twitter thread: https://twitter.com/chrisachard/status/1181583499112976384

1. Regular expressions find parts of a string that match a pattern

In JavaScript they're created in between forward slashes //, or with new RegExp()

and then used in methods like match, test, or replace

You can define the regex beforehand, or directly when calling the method

2. Match individual characters one at a time,

or put multiple characters in square brackets [] to capture any that match

Capture a range of characters with a hyphen -

3. Add optional flags to the end of a regex to modify how the matcher works.

In JavaScript, these flags are:

i = case insensitive
m = multi line matching
g = global match (find all, instead of find one)

4. Using a caret ^ at the start means "start of string"

Using a dollar sign $ at the end means "end of string"

Start putting groups of matches together to match longer strings

5. Use wildcards and special escaped characters to match larger classes of characters

. = any character except line break

\d = digit
\D = NOT a digit

\s = white space
\S = any NON white space

\n new line

6. Match only certain counts of matched characters or groups with quantifiers
  • = zero or more
  • = one more more ? = 0 or 1 {3} = exactly 3 times {2, 4} = two, three, or four times {2,} = two or more times

7. Use parens () to capture in a group

match will return the full match plus the groups, unless you use the g flag

Use the pipe operator | inside of parens () to specify what that group matches

| = or

8. To match special characters, escape them with a backslash \

Special characters in JS regex are: ^ $ \ . * + ? ( ) [ ] { } |

So to match an asterisks, you'd use:

*

Instead of just *

9. To match anything BUT a certain character, use a caret ^ inside of square brackets

This means ^ has two meanings, which can be confusing.

It means both "start of string" when it is at the front of a regex, and "not this character" when used inside of square brackets.

10. Regexs can be used to find and match all sort of things, from urls to filenames

HOWEVER! be careful if you try to use regexs for really complex tasks, such as parsing emails (which get really confusing, really fast), or HTML (which is not a regular language, and so can't be fully parsed by a regular expression)

There is (of course) much more to regex like lazy vs greedy, lookahead, and capturing

but most of what web developers want to do with regular expressions can use just these base building blocks.