While processing raw data from any source, extracting the right information is important so that meaningful insights can be obtained from the data. Sometimes it becomes difficult to take out the specific pattern from the data especially in the case of textual data.

The textual data consist of paragraphs of information collected via survey forms, scrapping websites, and other sources. The Channing of different string accessors with pandas functions or other custom functions can get the work done, but what if a more specific pattern needs to be obtained? Regular expressions do this job with ease.

What is a Regular Expression (RegEx)?

A regular expression is a representation of a set of characters for strings. It presents a generalized formula for a particular pattern in the strings which helps in segregating the right information from the pool of data. The expression usually consists of symbols or characters that help in forming the rule but, at first glance, it may seem weird and difficult to grasp. These symbols have associated meanings that are described here.

Meta-characters in RegEx

  1. ‘.’: is a wildcard, matches a single character (any character, but just once)
  2. ^: denotes start of the string
  3. $: denotes the end of the string
  4. [ ]: matches one of the sets of characters within [ ]
  5. [a-z]: matches one of the range of characters a,b,…,z
  6. [^abc] : matches a character that is not a,b or c.
  7. a|b: matches either a or b, where a and b are strings
  8. () : provides scoping for operators
  9. \ : enables escape for special characters (\t, \n, \b, .)
  10. \b: matches word boundary
  11. \d : any digit, equivalent to [0-9]
  12. \D: any non digit, equivalent to [^0-9]
  13. \s : any whitespace, equivalent to [ \t\n\r\f\v]
  14. \S : any non-whitespace, equivalent to [^\t\n\r\f\v]
  15. \w : any alphanumeric, equivalent to [a-zA-Z0-9_]
  16. \W : any non-alphanumeric, equivalent to [^a-zA-Z0-9_]
  17. ‘*’: matches zero or more occurrences
  18. ‘+’: matches one or more occurrences
  19. ‘?’: matches zero or one occurrence
  20. {n}: exactly n repetitions, n>=0
  21. {n,}: at least n repetitions
  22. {,n}: at most n repetitions
  23. {m,n}: at least m repetitions and at most n repetitions

#data science #python #regular expression #regular expression in python

Regular Expressions in Python [With Examples]: How to Implement? | upGrad blog
1.20 GEEK