Regex (Regular Expressions) Demystified

To fully utilize the power of shell scripting (and programming), one needs to master Regular Expressions. Certain commands and utilities commonly used in scripts, such as grep, expr, sed and awk use REs.

Image for post

In this article we are going to talk about Regular Expressions

What is Regex?

Regular Expressions are sets of characters and/or metacharacters that match (or specify) patterns. The main uses for Regular Expressions (REs) are text searches and string manipulation. An RE matches a single character or a set of characters — a string or a part of a string.

Those characters having an interpretation above and beyond their literal meaning are called metacharacters.

Regex Pattern:

Generally you define a Regex pattern by enclosing that pattern (without any additional quotes) within two forward-slashes. For example, _/\w/_, and _/[aeiou]/_.

Case Sensitivity:

Note that regex engines are case sensitive by default, unless you tell the regex engine to ignore the differences in case.

Regex uses:

When you scan a string (may be multi-line) with a regex pattern, you can get following information:

Whether there is any match or not
Matched substrings within given string
Position of these substring within given string
Group back references for every substring
When used with \A, and \Z, rather than a matching substring, we can match whole of the given string as a unit

Regex Metacharacters

Inside a pattern, all characters except (, ), [, ], {, }, |, \, ?, *, +, ., ^, and $ match themselves. If you want to match one of the special characters literally in a pattern, precede it with a backslash.

Note: _Even __/_ cannot be used inside a pattern, you can escape it by preceding it with backslash.

Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like {1,3}. So you generally do not need to escape it with a backslash, though you can do so if you want. An exception to this rule is the java.util.regex package which requires all literal braces to be escaped.

Escaping a Metacharacter:

The _\_ (backslash) is used to escape special characters and is used to give special meaning to some normal characters. For example, _\1_ is used to back reference first word and _\d_ means a digit character, and _\D_ means non-digit character, and to specify non-printable characters such as _\n_ (LF), _\r_ (CR), and _\t_ (tab).

Note: You can also escape backslash with backslash.

Escaping a single meta-character with a backslash works in all regular expression flavors.

All other characters should not be escaped with a backslash. That is because the backslash is also a special character. The backslash in combination with a literal character can create a regex token with a special meaning. For example, \d will match a single digit from 0 to 9.

As a programmer, you may be surprised that characters like the single quote and double quote are not special characters.

Special characters and programming languages:

In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters will be processed by the compiler, before the regex library sees the string.

Non-Printable Characters:

You can use special character sequences to put non-printable characters in your regular expression.

Use **\t** to match a tab character (ASCII 0x09), **\r** for carriage return (0x0D) and **\n** for line feed (0x0A).
More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B).
Remember that Windows text files use **\r\n** to terminate lines, while UNIX (Linux and Mac OS X) text files use **\n**** (LF)**, and \r (CR) in older versions of Mac OS.
You can include any character in your regular expression if you know its hexadecimal ASCII or ANSI code for the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use \xA9.
If your regular expression engine supports Unicode, use \uFFFF rather than \xFF to insert a Unicode character. The euro currency sign occupies code point 0x20AC. If you cannot type it on your keyboard, you can insert it into a regular expression with \u20AC.

Basic vs. Extended Regular Expressions:

Refer: http://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html

In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, $, and $.

Portable scripts should avoid { is **grep -E** patterns and should use [{] to match a literal {. Some implementations support \{ as meta-character.

How a Regex Engine works internally?

Knowing how the regex engine works will enable you to craft better regexes more easily.

The regex-directed engines are more powerful:

There are two kinds of regular expression engines:

text-directed engines, and
regex-directed (important) engines.

Certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines. No surprise that this kind of engine is more popular.

Notable tools that use text-directed engines are awk, egrep, flex, lex, MySQL and Procmail. For awk and egrep, there are a few versions of these tools that use a regex-directed engine.

You can easily find out whether the regex flavor you intend to use has a text-directed or regex-directed engine. If backreferences and/or lazy quantifiers are available, you can be certain the engine is regex-directed. You can do the test by applying the regex /regex|regex not/ to the string regex not. If the resulting match is only regex, the engine is regex-directed. If the result is regex not, then it is text-directed. The reason behind this is that the regex-directed engine is eager.

The Regex-Directed Engine Always Returns the Leftmost Match:

This is a very important point to understand: a regex-directed engine will always return the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.

#regex #patterns #programming #linux #regular-expressions

What is Regex?

Regex Metacharacters

How a Regex Engine works internally?

medium.com

Regex (Regular Expressions) Demystified