We live in a data-centric age. Data has been described as the new oil. But just like oil, data isn’t always useful in its raw form. One form of data that is particularly hard to use in its raw form is unstructured data.

A lot of data is unstructured data. Unstructured data doesn’t fit nicely into a format for analysis, like an Excel spreadsheet or a data frame. Text data is a common type of unstructured data and this makes it difficult to work with. Enter regular expressions, or regex for short. They may look a little intimidating at first, but once you get started, using them will be a picnic!

More comfortable with python? Try my tutorial for using regex with python instead:

A Gentle Introduction to Regular Expressions with Python

Regular expressions are the data scientist’s most formidable weapon against unstructured text

towardsdatascience.com

The stringr Library

We’ll use the stringr library. The stringr library is built off a C library, so all of its functions are very fast.

To install and load the stringr library in R, use the following commands:

## Install stringer
install.packages("stringr")

## Load stringr
library(stringr)

See how easy that is? To make things even easier, most function names in the stringr package start with str. Let’s take a look at a couple of the functions we have available to us in this module:

  1. str_extract_all(string, pattern): This function returns a list with a vector containing all instances of pattern in string
  2. str_replace_all(string, pattern, replacement): This function returns string with instances of pattern in string replaced with replacement

You may have already used these functions. They have pretty straightforward applications without adding regex. Think back to the times before social distancing and imagine a nice picnic in the park, like the image above. Here’s an example string with what everyone is bringing to the picnic. We can use it to demonstrate the basic usage of the regex functions:

basicString <- "Drew has 3 watermelons, Alex has 4 hamburgers, Karina has 12 tamales, and Anna has 6 soft pretzels"

If I want to pull every instance of one person’s name from this string, I would simply pass the name and basic_string to str_extract_all():

basicExtractAll <- str_extract_all(basicString, "Drew")
print(basicExtractAll)

The result will be a list with all occurrences of the pattern. Using this example, basicExtractAll will have the following list with 1 vector as output:

[[1]]
[1] "Drew"

Now let’s imagine that Alex left his 4 hamburgers unattended at the picnic and they were stolen by Shawn. str_replace_all can replace any instances of Alex with Shawn:

basicReplaceAll <- str_replace_all(basicString, "Alex", "Shawn")
print(basicReplaceAll)

The resulting string will show that Shawn now has 4 hamburgers. What a lucky guy 🍔.

"Drew has 3 watermelons, Shawn has 4 hamburgers, Karina has 12 tamales, and Anna has 6 soft pretzels"

The examples so far are pretty basic. There is a time and place for them, but what if we want to know how many total food items there are at the picnic? Who are all the people with items? What if we need this data in a data frame for further analysis? This is where you will start to see the benefits of regex.

#regex #regular-expressions #r #text-processing #unstructured-data #express

A Gentle Introduction to Regular Expressions with R
1.40 GEEK