In this tutorial, you will learn about regular expressions, called RegExes (RegEx) for short, and use Python's re module to work with regular expressions. RegEx is incredibly useful, and so you must get your head around it early. Regular expressions are the default way of data cleaning and wrangling in Python. Be it extraction of specific parts of text from web pages, making sense of twitter data or preparing your data for text mining – Regular expressions are your best bet for all these tasks.

What is a regular expression in Python?

You may be familiar with searching for text using shortcut ctrl + F and entering the text you are looking for. Regular expressions go one step further: They allow you to specify a pattern of text to search for. Essentially RegEx as a sequence of characters that defines a search pattern. Knowing regular expressions can mean the difference between solving a problem in 3 steps and solving in 3,000 steps.

For example, you may need to find in some text a phone number that you don't know, but if you live in the USA or Canada, you know it will be three digits, followed by a hyphen, then another three digits followed by a hyphen and then four more digits. Humans are good at recognising patterns, so you will know that 415-555-3456 is a phone number, but 6789,78564,67708879 is not.

Regular expressions are supported by most of the programming languages like Python, Perl, R, Java and many others. In this post, you’ll explore regular expressions in Python only.

How do you use regular expressions in Python?

If you don't know how to use regexes and you want to find a phone number in a string, you will have to write a relatively complex function, and it will take longer for your code to run, compare to regular expressions. I hope by now, I managed to convince you to learn regex and save yourself a ton of time.

Regular expressions are descriptions for a pattern of text. For instance, a \d in a regex stands for a digit character - that is, any single numeral 0 to 9. The regex \d\d\d-\d\d\d-\d\d\d\d is used by Python to match a string of three numbers, a hyphen, three more numbers, another hyphen, and four numbers. Anything else would not match the \d\d\d-\d\d\d-\d\d\d\d regex.

{} - Braces

Regular expression for the same pattern can be also defined as \d{3}-\d{3}-\d{4}.   Adding a 3 in curly brackets {3} after a pattern is like saying, "Match this pattern three times." So \d\d\d-\d\d\d-\d\d\d\d and \d{3}-\d{3}-\d{4} will find the same pattern - phone number format.

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it. This RegEx [0-9]{2, 4} matches at least two digits but not more than four digits.

Character Classes

In the phone number regex example, you learned that \d could stand for any numeric digit. There are many such shorthand character classes, as shown below.

\d - Matches any decimal digit. Equivalent to any single numeral 0 to 9.

\D - Matches any character that is not a numeric digit from 0 to 9.

\s - Matches where a string contains any whitespace character. Equivalent to any space, tab, or newline charecter. (Think of this as matching "space" charecters.)

\S - Matches any character that is not a space, tab or newline.

\w - Matches any alphanumeric character (digits and alphabets), or the underscore charecter. Equivalent to [a-zA-Z0-9_].

\W - Matches any non-alphanumeric character. Any charecter that is not a letter, number, or the underscore charecter.

\Z - Matches if the specified characters are at the end of a string. Expession Python\Z will match text "I love Python" but, would not match I like Python Programming.

Square brackets - make your own  charecter classes

From time to time, you will want to match a set of characters, but you will find that the shorthand character classes ( \d, \w, \s, and so on) are too broad. In such a case, you can define your character class using square brackets. As an illustration, the character class [aeiou] will match any lowercase vowel.

[] - Square brackets specifies a set of characters you wish to match.

MetaCharacters

To define regular expressions, metacharacters are used. For example, \ and ? are metacharacters. Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters: [] . ^ $ * + ? {} () \ |

. Period\Dot - A period matches any single character (except newline '\n').

^ Caret - The caret symbol ^ is used to check if a string starts with a certain character.

$ Dollar Symbol - The dollar symbol $ is used to check if a string ends with a certain character.

* Star - The star symbol * matches zero or more occurrences of the pattern left to it.

+ Plus - The plus symbol + matches one or more occurrences of the pattern left to it.

? Question mark -The question mark symbol ? matches zero or one occurrence of the pattern left to it.

| Vertical bar - Vertical bar | is used for alternation (or operator).

() Parentheses - Parentheses () is used to group sub-patterns.                        For example, (a|b|c)xz match any string that matches either a or b or c followed by xz.

\ Backlash - \ backlash is used to escape various characters including all metacharacters. For example, \$a match if a string contains $ followed by a. Here, $ is not specially interpreted by a RegEx engine. If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated specially.

How to escape MetaCharacters in Regex using Python

If you need to define a simple pattern like we did with the phone number exsample \d\d\d-\d\d\d-\d\d\d\d then you don't need to worry about metacharacters if you use r in re.compile() function. Remember that -  the underscore charecter is considered  an  alphanumeric character (digits and alphabets) by Regex.

However, if you need to define a slightly more complex pattern where a pattern includes one or multiple metacharacters then you need to know how to escape such characters in Python. This can be done by using the backslash \. The string value \n represents a single newline charecter, not a backslash followed by a lowercase n. You need to enter the escape character \\ to print a single backlash. So \\n is the string that represents a backslash followed by a lowercase n.

Alternatively, you can use r to mark your string as a raw string, which does not escape charecters, by putting it before the first quote of the string value. Since Regex  frequently use backlashes and other metacharecters in them, it is convinient to pass raw strings to the re.compile() function instead of typing extra backslashes. Entering r'\d\d\d-\d\d\d-\d\d\d\d is easier than typing r'\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d.

Python RegEx - module re

Python has a module named re to work with regular expressions. You can find all the regex functions in Python in the re module. To use it, we need to import the module:

import re

Passing a string value representing your Regex to re.compile() returns a Regex object .

The most common uses of regular expressions are:

  • Search a string (search and match)
  • Finding a string (findall)
  • Break string into a sub strings (split)
  • Replace part of a string (sub)

Let’s look at the methods that library “re” provides to perform these tasks.

  1. re.compile () -  re.compile(<regex>, flags=0) Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below. The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).
  2. re.match() - re.match(<regex>, <string>, flags=0) If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. It will return None if the string does not match the pattern.
  3. re.search()  - re.search(<regex>, <string>)  scans provided string value looking for the first location where the pattern Regex matches. If a match is found, then re.search() returns a match object. Otherwise, it returns None.
  4. re.findall() - re.findall(<regex>, <string>, flags=0) method returns a list of strings containing all matches.
  5. re.split() - re.split(<regex>, <string>, maxsplit=0, flags=0) Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.
  6. re.sub() - re.sub(<regex>, <replace>, <string>, count=0, flags=0) The method returns a string where matched occurrences are replaced with the content of replace variable.

In this article, I did not cover all or functions, constants, and an exception that module re provides, but I will provide detailed walkthrough tutorials later in the Regex series of tutorials. If you want to learn more about module re check out its documentation.