Python

Python RegEx Tutorial

In this article, we will be looking at the REGEX (Regular Expression) module in Python.

1. What is REGEX?

REGEX stands for Regular expressions. It is a string of characters that specifies a pattern, and we used regex for locating, matching, and managing text. Regular expressions use regular language, and Stephen Cole Kleene first developed regular language in 1951. The Syntax for Regex is available here.

2. Regex module in Python

Python stores all its regex-related functionality using the “re” module. To use the re module, we need to import the “re” module. The “Re” module provides Perl-like regex support in Python and its complete documentation is available here.

To understand how to use the Regex module, let us see a couple of simple examples. We will use one of the functions present in the regex module to see how regex works. The first step is to import re

import re
re.search('1111','foofighters1111')

Here 1111 is the regex pattern, and foofighters1111 is the input String.

regex python - Simple regex example
Simple regex example

3. Regex Functions

“re” module has many functions; the entire list is available in the documentation under Module contents. Here we will look at some very commonly used methods.

3.1 findAll()

FindAll returns all the non-overlapping matches in a string in a list. If it finds no matches, then it returns an empty list. To use the findAll we need the following parameters:

 re.findall(pattern, string, flags=0)

This function returns a Match object when it matches the regex pattern with the input String and if it finds a match and None when there are no matches. The Syntax of search is

re.search(pattern, string, flags=0)

3.3 split()

The split function splits the string based on the regex given and returns a List. Alternatively, it returns an empty list in case of no matches. Its syntax is:

re.split(pattern, string, maxsplit=0, flags=0)

Besides the pattern and the String, we can also mention the maximum number of splits we want. After it reaches the max splits, the rest of the text is returned as-is.

3.4 sub()

To replace the matched pattern with another string, we use the sub() function. Sub has the following arguments:

re.sub(pattern, repl, string, count=0, flags=0)

We can optionally also set the count I,e, number of occurrences we want replacing. If the count is not specified, sub replaces all the occurrences it finds starting with the leftmost.

3.5 match()

The match function checks if the pattern matches the beginning of the String or not. If it finds a match, then a Match object is returned. Else it returns None. The syntax is as follows:

re.match(pattern, string, flags=0)

4. Metacharacters

Before we get into seeing examples of the functions, we need to understand Metacharacters. Metacharacters are sort of like keywords for a regex pattern. These are special characters that have specific meanings and are used to build patterns/ regexes to search. The Metacharacters that Python uses are as follows:

CharacterDescriptionExample
[]It checks for a string of characters“[m-p]” or “[A-D]”
\This is used as an escape character for special characters“\d”
.Any character except for the newline“b…lding”
^To signify starts with some patterns“^c”
*To check if there are zero or more occurrences   “*bid”
$To check if the String ends with the pattern     “goodbye$”
+One or more occurrences       “aid+”
{}To specify the exact number of occurrences “b{2}”
|This is to specify either-or    “up|down”
()Group and capture
Metacharacters

5. Special Sequences

Besides Metacharacters, we also use special sequences. We mention the Special Sequences after the \ metacharacter. The sequences available are as follows:

SequenceDescription
\AThe characters after the A are at the start of the string
\bThe pattern after the \b are either at the beginning or at the end of the string
\BThe pattern can be anywhere in the String except at the start or end of the string
\dThe String contains digits, i.e., 0 to 9
\DThe string does not contain digits
\sString contains whitespaces
\SString Does not contain whitespaces
\wThe string contains a to z or 0 to 9 or the underscore character
\WString does not contain any word characters
\Z The pattern is at the end of the string
Special Sequences

6. Sets

Besides the Special sequence, we also have Sets. We enclose Sets in [], and they are a set of characters with special meaning. The sets available are:

SetDescription
[bdf]If any one of the specified characters b, d, or f is present in the input
[a-n]returns any characters between a and n from the input string. Only lowercase considered.
[^are]all other characters except the ones mentioned are returned.
[0123]returns the digits of they are from 0,1,2, and 3
[0-9]returns any digit between 0 and 9
[0-7][0-9]returns any numbers between 00 and 79.
[a-zA-Z]returns any alphabets. Both lowercase and uppercase are considered.
[+]returns any + signs found in the pattern.
Sets

7. Examples

We will look at all the different parameters we looked at in the below examples. We have added all our examples in a single Python script called regex_examples.py

regex_examples.py

import re

#findall method
txt = "By the pricking of my thumbs, Something wicked my way comes. Open, locks, Whoever knocks!"

#findall with sets
#Only lowercase characters will be considered
lowChar = re.findall("[p-t]", txt)
print("findall with lowercase::", lowChar)
print(" ")

#To ignore case we can add a flag to ignore the case of the string.
ignoreCaseChar = re.findall("[p-t]",txt,flags = re.I)
print("findall with the Ignore case flag:: ",ignoreCaseChar)
print("\n")

#search
searchString = re.search("my", txt)
print("Search output: ", searchString)
print("\n")

#Split
splitString = re.split("\s", txt)
print("Split String on whitespaces output: ", splitString)
print("\n")


#Split with maxnumber. The rest of the string is returned as-is
splitMaxNum = re.split("\s", txt, 3)
print("Split String with whitespace and max number output: ", splitMaxNum)
print("\n")

inputString = "2004-959-559 # Thorin Oakenshield # The King Under the Mountain"

#Substitue characters
#The r at the start is to make sure that the raw string is considered.
substituteString = re.sub(r'#.*$', "", inputString)
print("Substituted String :: ", substituteString)
print("\n")

#Replace everything other than digits
onlyNumbers = re.sub(r'\D', "", inputString)    
print("Replace everything except numbers : ", onlyNumbers)
print("\n")


newString = "The lady doth protest too much, methinks.The better part of valor is discretion.The course of true love never did run smooth."

#FindAll using Special sequences
startOfString = re.findall("\AThe", newString)
print("findall with only at the start special sequences: ", startOfString)
print("\n")

#FindAll if not at start or at end. Will not return a result since the is at the start.
neitherStartnorEnd = re.findall('\BThe', newString)
print("find all Not at Start or end, no output: ", neitherStartnorEnd)
print("\n")

newInput = "Lord, what fools these mortals be!.The fault, dear Brutus, lies not within the stars, but in ourselves, that we are underlings."

#This will return a list where 'es' is neither at start or the end
neitherStartnorEnd1 = re.findall(r"\Bes", newInput)
print("Not at Start or end: ", neitherStartnorEnd1)
print("\n")


#Match function
testString = 'Brevity is the soul of wit.'

#This will not return any result becase the pattern will check only for lowercase characters
matchResult = re.match('^b...y$', testString)

print("Match function output: ", matchResult)
print("\n")

#To match the pattern without case, we do
matchResultIgnoreCase = re.match("^b\w+", testString,flags=re.I)

print("Match function without case", matchResultIgnoreCase)
print("\n")
regex python - Program output
Program output

8. Summary

In this article, we looked at regex support that Python provides. Python also has a third-party module called regex which is available to download here.

10. Download the Source Code

Above we saw examples of using regex in Python.

Download
You can download the full source code of this example here: Python RegEx Tutorial

Last updated on May 18th, 2021

Reshma Sathe

I am a recent Master of Computer Science degree graduate from the University Of Illinois at Urbana-Champaign.I have previously worked as a Software Engineer with projects ranging from production support to programming and software engineering.I am currently working on self-driven projects in Java, Python and Angular and also exploring other frontend and backend technologies.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button