Regex (Regular Expressions) Demystified

To fully utilize the power of shell scripting (and programming), one needs to master Regular Expressions. Certain commands and utilities commonly used in scripts, such as grepexprsed and awk use REs.

Image for post

In this article we are going to talk about Regular Expressions

What is Regex?

Regular Expressions are sets of characters and/or metacharacters that match (or specify) patterns. The main uses for Regular Expressions (REs) are text searches and string manipulation. An RE matches a single character or a set of characters — a string or a part of a string.

Those characters having an interpretation above and beyond their literal meaning are called metacharacters.

Regex Pattern:

Generally you define a Regex pattern by enclosing that pattern (without any additional quotes) within two forward-slashes. For example, _/\w/_, and _/[aeiou]/_.

Case Sensitivity:

Note that regex engines are case sensitive by default, unless you tell the regex engine to ignore the differences in case.

Regex uses:

When you scan a string (may be multi-line) with a regex pattern, you can get following information:

  • Whether there is any match or not
  • Matched substrings within given string
  • Position of these substring within given string
  • Group back references for every substring
  • When used with \A, and \Z, rather than a matching substring, we can match whole of the given string as a unit

Regex Metacharacters

Inside a pattern, all characters except ()[]{}|\?*+.^, and $ match themselves. If you want to match one of the special characters literally in a pattern, precede it with a backslash.

Note: _Even __/_ cannot be used inside a pattern, you can escape it by preceding it with backslash.

Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like {1,3}. So you generally do not need to escape it with a backslash, though you can do so if you want. An exception to this rule is the java.util.regex package which requires all literal braces to be escaped.

Escaping a Metacharacter:

The _\_ (backslash) is used to escape special characters and is used to give special meaning to some normal characters. For example, _\1_ is used to back reference first word and _\d_ means a digit character, and _\D_ means non-digit character, and to specify non-printable characters such as _\n_ (LF), _\r_ (CR), and _\t_ (tab).

Note: You can also escape backslash with backslash.

Escaping a single meta-character with a backslash works in all regular expression flavors.

All other characters should not be escaped with a backslash. That is because the backslash is also a special character. The backslash in combination with a literal character can create a regex token with a special meaning. For example, \d will match a single digit from 0 to 9.

As a programmer, you may be surprised that characters like the single quote and double quote are not special characters.

Special characters and programming languages:

In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters will be processed by the compiler, before the regex library sees the string.

Non-Printable Characters:

You can use special character sequences to put non-printable characters in your regular expression.

  • Use **\t** to match a tab character (ASCII 0x09), **\r** for carriage return (0x0D) and **\n** for line feed (0x0A).
  • More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B).
  • Remember that Windows text files use **\r\n** to terminate lines, while UNIX (Linux and Mac OS X) text files use **\n**** (LF)**, and \r (CR) in older versions of Mac OS.
  • You can include any character in your regular expression if you know its hexadecimal ASCII or ANSI code for the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use \xA9.
  • If your regular expression engine supports Unicode, use \uFFFF rather than \xFF to insert a Unicode character. The euro currency sign occupies code point 0x20AC. If you cannot type it on your keyboard, you can insert it into a regular expression with \u20AC.

Basic vs. Extended Regular Expressions:

Refer: http://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html

In basic regular expressions the meta-characters ?+{|(, and ) lose their special meaning; instead use the backslashed versions \?\+\{\|\(, and \).

Portable scripts should avoid { is **grep -E** patterns and should use [{] to match a literal {. Some implementations support \{ as meta-character.


How a Regex Engine works internally?

Knowing how the regex engine works will enable you to craft better regexes more easily.

The regex-directed engines are more powerful:

There are two kinds of regular expression engines:

  • text-directed engines, and
  • regex-directed (important) engines.

Certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines. No surprise that this kind of engine is more popular.

Notable tools that use text-directed engines are awkegrepflexlexMySQL and Procmail. For awk and egrep, there are a few versions of these tools that use a regex-directed engine.

You can easily find out whether the regex flavor you intend to use has a text-directed or regex-directed engine. If backreferences and/or lazy quantifiers are available, you can be certain the engine is regex-directed. You can do the test by applying the regex /regex|regex not/ to the string regex not. If the resulting match is only regex, the engine is regex-directed. If the result is regex not, then it is text-directed. The reason behind this is that the regex-directed engine is eager.

The Regex-Directed Engine Always Returns the Leftmost Match:

This is a very important point to understand: a regex-directed engine will always return the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.

#regex #patterns #programming #linux #regular-expressions

What is GEEK

Buddha Community

Regex (Regular Expressions) Demystified

Mad Libs: Using regular expressions

From Tiny Python Projects by Ken Youens-Clark

Everyone loves Mad Libs! And everyone loves Python. This article shows you how to have fun with both and learn some programming skills along the way.


Take 40% off Tiny Python Projects by entering fccclark into the discount code box at checkout at manning.com.


When I was a wee lad, we used to play at Mad Libs for hours and hours. This was before computers, mind you, before televisions or radio or even paper! No, scratch that, we had paper. Anyway, the point is we only had Mad Libs to play, and we loved it! And now you must play!

We’ll write a program called mad.py  which reads a file given as a positional argument and finds all the placeholders noted in angle brackets like <verb>  or <adjective> . For each placeholder, we’ll prompt the user for the part of speech being requested like “Give me a verb” and “Give me an adjective.” (Notice that you’ll need to use the correct article.) Each value from the user replaces the placeholder in the text, and if the user says “drive” for “verb,” then <verb>  in the text replaces with drive . When all the placeholders have been replaced with inputs from the user, print out the new text.

#python #regular-expressions #python-programming #python3 #mad libs: using regular expressions #using regular expressions

Madyson  Reilly

Madyson Reilly

1601055000

Regular Expressions: What and Why?

Regular expressions is a powerful search and replace technique that you probably have used even without knowing. Be it your text editor’s “Find and Replace” feature, validation of your http request body using a third party npm module or your terminal’s ability to return list of files based on some pattern, all of them use Regular Expressions in one way or the other. It is not a concept that programmers must definitely learn but by knowing it you are able to reduce the complexity of your code in some cases.

_In this tutorial we will be learning the key concepts as well as some use cases of Regular Expressions in _javascript.

How do you write a Regular Expression?

There are two ways of writing Regular expressions in Javascript. One is by creating a **literal **and the other is using **RegExp **constructor.

//Literal
const myRegex=/cat/ig

//RegExp
const myRegex=new RegExp('cat','ig')

While both types of expressions will return the same output when tested on a particular string, the benefit of using the RegExp constructor is that it is evaluated at runtime hence allowing use of javascript variables for dynamic regular expressions. Moreover as seen in this benchmark test the RegExp constructor performs better than the literal regular expression in pattern matching.

The syntax in either type of expression consists of two parts:

  • pattern : The pattern that has to be matched in a string.
  • flags : these are modifiers which are rules that describe how pattern matching will be performed.

#regular-expressions #javascript #programming #js #regex #express

Regex (Regular Expressions) Demystified

To fully utilize the power of shell scripting (and programming), one needs to master Regular Expressions. Certain commands and utilities commonly used in scripts, such as grepexprsed and awk use REs.

Image for post

In this article we are going to talk about Regular Expressions

What is Regex?

Regular Expressions are sets of characters and/or metacharacters that match (or specify) patterns. The main uses for Regular Expressions (REs) are text searches and string manipulation. An RE matches a single character or a set of characters — a string or a part of a string.

Those characters having an interpretation above and beyond their literal meaning are called metacharacters.

Regex Pattern:

Generally you define a Regex pattern by enclosing that pattern (without any additional quotes) within two forward-slashes. For example, _/\w/_, and _/[aeiou]/_.

Case Sensitivity:

Note that regex engines are case sensitive by default, unless you tell the regex engine to ignore the differences in case.

Regex uses:

When you scan a string (may be multi-line) with a regex pattern, you can get following information:

  • Whether there is any match or not
  • Matched substrings within given string
  • Position of these substring within given string
  • Group back references for every substring
  • When used with \A, and \Z, rather than a matching substring, we can match whole of the given string as a unit

Regex Metacharacters

Inside a pattern, all characters except ()[]{}|\?*+.^, and $ match themselves. If you want to match one of the special characters literally in a pattern, precede it with a backslash.

Note: _Even __/_ cannot be used inside a pattern, you can escape it by preceding it with backslash.

Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like {1,3}. So you generally do not need to escape it with a backslash, though you can do so if you want. An exception to this rule is the java.util.regex package which requires all literal braces to be escaped.

Escaping a Metacharacter:

The _\_ (backslash) is used to escape special characters and is used to give special meaning to some normal characters. For example, _\1_ is used to back reference first word and _\d_ means a digit character, and _\D_ means non-digit character, and to specify non-printable characters such as _\n_ (LF), _\r_ (CR), and _\t_ (tab).

Note: You can also escape backslash with backslash.

Escaping a single meta-character with a backslash works in all regular expression flavors.

All other characters should not be escaped with a backslash. That is because the backslash is also a special character. The backslash in combination with a literal character can create a regex token with a special meaning. For example, \d will match a single digit from 0 to 9.

As a programmer, you may be surprised that characters like the single quote and double quote are not special characters.

Special characters and programming languages:

In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters will be processed by the compiler, before the regex library sees the string.

Non-Printable Characters:

You can use special character sequences to put non-printable characters in your regular expression.

  • Use **\t** to match a tab character (ASCII 0x09), **\r** for carriage return (0x0D) and **\n** for line feed (0x0A).
  • More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B).
  • Remember that Windows text files use **\r\n** to terminate lines, while UNIX (Linux and Mac OS X) text files use **\n**** (LF)**, and \r (CR) in older versions of Mac OS.
  • You can include any character in your regular expression if you know its hexadecimal ASCII or ANSI code for the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use \xA9.
  • If your regular expression engine supports Unicode, use \uFFFF rather than \xFF to insert a Unicode character. The euro currency sign occupies code point 0x20AC. If you cannot type it on your keyboard, you can insert it into a regular expression with \u20AC.

Basic vs. Extended Regular Expressions:

Refer: http://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html

In basic regular expressions the meta-characters ?+{|(, and ) lose their special meaning; instead use the backslashed versions \?\+\{\|\(, and \).

Portable scripts should avoid { is **grep -E** patterns and should use [{] to match a literal {. Some implementations support \{ as meta-character.


How a Regex Engine works internally?

Knowing how the regex engine works will enable you to craft better regexes more easily.

The regex-directed engines are more powerful:

There are two kinds of regular expression engines:

  • text-directed engines, and
  • regex-directed (important) engines.

Certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines. No surprise that this kind of engine is more popular.

Notable tools that use text-directed engines are awkegrepflexlexMySQL and Procmail. For awk and egrep, there are a few versions of these tools that use a regex-directed engine.

You can easily find out whether the regex flavor you intend to use has a text-directed or regex-directed engine. If backreferences and/or lazy quantifiers are available, you can be certain the engine is regex-directed. You can do the test by applying the regex /regex|regex not/ to the string regex not. If the resulting match is only regex, the engine is regex-directed. If the result is regex not, then it is text-directed. The reason behind this is that the regex-directed engine is eager.

The Regex-Directed Engine Always Returns the Leftmost Match:

This is a very important point to understand: a regex-directed engine will always return the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.

#regex #patterns #programming #linux #regular-expressions

A Gentle Introduction to Regular Expressions with R

We live in a data-centric age. Data has been described as the new oil. But just like oil, data isn’t always useful in its raw form. One form of data that is particularly hard to use in its raw form is unstructured data.

A lot of data is unstructured data. Unstructured data doesn’t fit nicely into a format for analysis, like an Excel spreadsheet or a data frame. Text data is a common type of unstructured data and this makes it difficult to work with. Enter regular expressions, or regex for short. They may look a little intimidating at first, but once you get started, using them will be a picnic!

More comfortable with python? Try my tutorial for using regex with python instead:

A Gentle Introduction to Regular Expressions with Python

Regular expressions are the data scientist’s most formidable weapon against unstructured text

towardsdatascience.com

The stringr Library

We’ll use the stringr library. The stringr library is built off a C library, so all of its functions are very fast.

To install and load the stringr library in R, use the following commands:

## Install stringer
install.packages("stringr")

## Load stringr
library(stringr)

See how easy that is? To make things even easier, most function names in the stringr package start with str. Let’s take a look at a couple of the functions we have available to us in this module:

  1. str_extract_all(string, pattern): This function returns a list with a vector containing all instances of pattern in string
  2. str_replace_all(string, pattern, replacement): This function returns string with instances of pattern in string replaced with replacement

You may have already used these functions. They have pretty straightforward applications without adding regex. Think back to the times before social distancing and imagine a nice picnic in the park, like the image above. Here’s an example string with what everyone is bringing to the picnic. We can use it to demonstrate the basic usage of the regex functions:

basicString <- "Drew has 3 watermelons, Alex has 4 hamburgers, Karina has 12 tamales, and Anna has 6 soft pretzels"

If I want to pull every instance of one person’s name from this string, I would simply pass the name and basic_string to str_extract_all():

basicExtractAll <- str_extract_all(basicString, "Drew")
print(basicExtractAll)

The result will be a list with all occurrences of the pattern. Using this example, basicExtractAll will have the following list with 1 vector as output:

[[1]]
[1] "Drew"

Now let’s imagine that Alex left his 4 hamburgers unattended at the picnic and they were stolen by Shawn. str_replace_all can replace any instances of Alex with Shawn:

basicReplaceAll <- str_replace_all(basicString, "Alex", "Shawn")
print(basicReplaceAll)

The resulting string will show that Shawn now has 4 hamburgers. What a lucky guy 🍔.

"Drew has 3 watermelons, Shawn has 4 hamburgers, Karina has 12 tamales, and Anna has 6 soft pretzels"

The examples so far are pretty basic. There is a time and place for them, but what if we want to know how many total food items there are at the picnic? Who are all the people with items? What if we need this data in a data frame for further analysis? This is where you will start to see the benefits of regex.

#regex #regular-expressions #r #text-processing #unstructured-data #express

Regular Expression Complete Guide

Regular expressions or regex puts a lot of people off, just because of its look at first glance. But once you master this it will open a whole new different level of doing string manipulation and the best part of it is that it can be used with mostly all of the programming language as well as with Linux commands. It can be used to find any kind of pattern that you can think of within the text and once you find the text you can do pretty much whatever you want to do with that text. By this example, you can get an idea of how powerful and useful regex is.
What is Regex?
If you are reading this post then most probably you already know what a regex is, if you don’t know here is a quick and easy definition
Regex stands for Regular Expression and is essentially an easy way to define a pattern of characters. The most common use of regex is in pattern identification, text mining, or input validation.

#regular-expressions #python #regex #python-regex #pattern-finding