Python RegEx Explained: Patterns and Techniques

Explore the world of Python RegEx with our comprehensive guide. Master powerful patterns, unleash coding magic, and elevate your Python skills today!

Python RegEx Explained: Patterns and Techniques

What is Python RegEx?

Python RegEx is an effective pattern matching tool, similar to string literals but enabling named groups and back references.

Python RegEx provides two special sequences. One, [] is used to encase a list of characters and match any one character within them; its use with repetition operator + can make matches less greedy.

Character Class

Character Class keywords specify how a pattern should be interpreted by the regex engine, making them useful when combining multiple regular expressions into one regular expression.

As an example, w matches word boundaries such as the beginning or end of a string; this distinguishes it from (a-z) and (A-Z), which match every lowercase letter in turn.

Quantifier metacharacters specify how many times an RE should be matched; leaving out this step specifies an infinite number. For instance, (0-9]+ matches one or more decimal digit characters. Please be aware that Python's built-in re function and module-level findall() and search() functions precompile an RE before running it to reduce overhead as your engine won't need to parse your RE each time - also known as inlining. Inlining may help decrease code size by eliminating redundant parsingsing processes over and over again - inlining allows engines to reduce parsing over and over again reducing engine parsing of your RE by inlining. Inlining can reduce overhead significantly and is great way of shrinking code sizes!

Python RegEx

Captured Groups

If a group you encounter doesn't fit within the parameters of the pattern or sits inside non-capturing parentheses, it won't get captured and won't be retrievable from either your match object or backreferences.

Repetitive Expressions (REs) that involve many groups can be cumbersome to write. Furthermore, complex REs often use multiple names for each capturing group which makes keeping track of its number difficult.

Your solution for this problem lies within the Captured Groups keyword. This sets a group index for any group named capture groups, mapping its symbolic name to its number corresponding to it and providing backreferences and matching verification purposes. Non-capturing groups can even be designated with (?Pname>). Neither option affects how groups are matched as group numbers are still assigned left to right and Group1 remains always Group 1.

Backreferences

Backreferences are special metacharacter sequences that match the contents of a previously captured group. You can use it later within the same regex to match against captured groups that match its number even after closing parentheses with (regex>). They number Python-style named groups along with unnamed ones, while.NET style named groups appear subsequently.

Example Regex Below (w+) matches "foo" and stores it as a captured group in RE engine's memory, using 1 as backreference for previously captured group's contents when matching "qux".

You can also use a named backreference in place of the regex> metacharacter. Prior to PCRE 6.7, backreferences pointed to any group with the same name that existed within a regex, regardless of whether or not they participated in matching it; with PCRE 6.7 and later updates this behavior changed so backreferences point only towards groups which actually participated in matching it.

Metacharacters

Python RegEx provides more than just grouping parentheses and backreferences; it also offers several metacharacters designed to augment how a regular expression works - known as enhanced grouping constructs.

Example: Using the (?Pname>) metacharacter sequence creates a named group which can later be referenced using backreference. Unlike numbered groups which can only be matched once within a regular expression, named groups are reusable and can be matched multiple times within that regular expression - provided their name is valid Python identifier which only ever appears once within that regular expression.

Similar to its positive counterparts, the - d metacharacter sequence creates a negative lookbehind assertion on a regex engine. In other words, it requires what follows its current position to not match any numbers; this contrasts with positive lookahead assertions which require numbers match prior to that position being reached in their search process. (d>) will match both 1 and 3, but not 'foo123bar' because its second match does not form part of its first.

RegEx Module

A RegEx or Regular Expression, is a sequence of characters that creates an expression pattern for search. RegEx could be utilized to determine whether a given string has the search pattern specified.

Python comes with a built-in program named re that can be used to manipulate Regular Expressions.

Import the module: module:

import re

RegEx in Python

Once you have re-imported the module into the module you are able to begin by using regular expressions.

Example

Find the string and see whether it begins at "The" and ends with "Spain":

import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

RegEx Functions

The Re module has a range of functions which allow us to search for a string to find a match:

Function Description
findall Lists all matches
search Returns an match object in the event that the string contains a match.
split It returns a list of where the string has been divided with every match
sub Replaces one or a number of matches using strings

Metacharacters

Metacharacters are characters that have an exclusive meaning:

Character Description Example
[] A collection of characters "[a-m]"
\ A special sequence of characters (can use to run away from specific characters) "\d"
. Any character (except newline character) "he..o"
^ It begins with "^hello"
$ Endes with "planet$"
* Zero or more occurrences "he. *o"
+ One or more occasions "he.+o"
? One or zero instances "he. ?o"
{} Just the number you specified of times "he. o"
| Either or "falls|stays"
() Capture and group

Special Sequences

A particular sequence is which is then followed by one of the letters listed in the following list, and has a particular meaning:

Character Description Example
\A Finds a match when the specified characters are at the start of the string. "\AThe"
\b Finds a match if the characters specified are at the start or the end of a phrase
(the "r" in the beginning signifies that it's considered"raw string" (the "r" in the beginning is to ensure that it's treated "raw string")
r"\bain"

r"ain\b"
\B Finds a match when the characters specified are present, but not near the start (or towards the conclusion) of a word.
(the "r" in the beginning signifies the string considered"raw string" (the "r" in the beginning is to ensure that it's treated "raw string")
r"\Bain"

r"ain\B"
\d Finds a match if the string is composed of numbers (numbers between 0-9) "\d"
\D Finds a match if the string DOES NOT include numbers "\D"
\s Finds a match if the string is white space character "\s"
\S Returns a match when the string does not contain white space characters. "\S"
\w Finds a match if the string includes any characters from the word (characters from A to Z, numbers from to 9 and the underscore character) "\w"
\W Returns a match when the string DOES NOT include any words "\W"
\Z Finds a match when the specified characters are at the beginning of the string. "Spain\Z"

Sets

A set is the collection of characters contained in the brackets of a pair of squares The word "set" is a set of characters with a particular meaning:

Set Description
[arn] Returns a match if one of the characters ( a, r or n) is present
[a-n] Finds a match for every lowercase character alphabetically, between A or n
[^arn] Returns a match to any character, excluding the characters a, r, and n.
[0123] Finds a match if each of the numbers ( 0, 1,, 2 or 3) are present
[0-9] Finds a match for any number between 9 9 9
[0-5][0-9] It will match any two-digit number between between 00 and 59.
[a-zA-Z] Finds a match for any alphabetical character between A and Z, either lower case or upper case
[+] In sets, * * . *, (), $, have no specific meaning therefore [+] can mean that it returns an exact match for any + character within the string

The Findall() Function

Findall() function returns a list of matches. function findall() function returns an array of matches.

Example

Print an entire list of matches:

import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

The list includes the matches in the order in which they were discovered.

If there are no matches the list will be empty. returned:

Example

Return a blank list if there was no match discovered:

import re

txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

The Search() Function

Its function search() function searches the string for matches and returns an match object in the event of an exact match.

If there are several matches in a row, just the initial event of the match will be reported:

Example

Find the first white space characters in your string.

import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

If there are no matches If there are no matches, the amount is returned as None. will be returned.

Example

Search for an empty result:

import re

txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

The Split() Function

Split() function returns a list of matches. splitting() function returns an array of strings that is split for each match:

Example

Each white-space character is split:

import re

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

You can limit the number of instances by setting maxsplit parameter: maxsplit parameter:

Example

Only split the string when the string is split:

import re

txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

The Sub() Function

Sub() function function sub() function replaces the matches with text you want to use:

Example

Replace each white space character with 9, the 9th number:

import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

You can limit the amount of replacements you receive by setting your counter parameter:

Example

Replace the 2 first instances:

import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

Match Object

An Match Object is an object with information regarding the search as well as the results.

NOTE: If there is no match The value none returns rather than the match Object.

Example

Search for a keyword that returns a match Object:

import re

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object

The Match object is a class that has properties and methods to obtain information about the search and the resulting:

.span() returns a tuple with the start- and the end-points of the game.
.string returns the string that is passed to the function
.group() returns the portion of the string in which there was an exact match

Example

Print the location (startand ending-position) of the initial match.

The regular expression searches for words that begin with a capital "S":

import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

Example

The string that is passed to the function:

import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)

Example

Print the portion of the string in which there was an exact match.

The regular expression searches for any word that begins in upper case "S":

import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

NOTE: If there is no match then The value of None is returned rather than the match Object.

For more informative blogs, Please visit Home

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow