regex
Regular expressions (regexes) are patterns used to match character combinations in strings. They are essential tools for text processing in programming and data science, allowing for:
- Checking the presence of patterns: Verify if certain patterns exist within text data.
- Extracting instances of complex patterns: Retrieve specific data that matches a pattern.
- Cleaning and manipulating data: Perform operations like replacing, splitting, or formatting text.
What Are Regular Expressions?
A regular expression is a sequence of characters that define a search pattern. They enable you to:
- Find specific words or phrases: Such as all four-letter words in a text.
- Parse log files: Extract error messages, process IDs, or timestamps.
- Validate input data: Ensure data matches a required format, like email addresses or phone numbers.
Regular expressions are powerful tools used across various programming languages and command-line tools, making them invaluable for developers, data scientists, and system administrators.
Basic Matching with Regular Expressions
Using the re
Module in Python
Python provides the re
module for working with regular expressions.
Importing the Module
import re
Key Functions
match()
: Checks for a match at the beginning of the string.search()
: Searches for a pattern anywhere in the string.findall()
: Returns a list of all non-overlapping matches.split()
: Splits the string by occurrences of the pattern.sub()
: Replaces occurrences of the pattern with a replacement string.
Basic Usage
text = "This is a good day."
if re.search("good", text):
print("Wonderful!")
else:
print("Alas :(")
Simple String Matching
Searching for Substrings
result = re.search(r"Python", "I love Python programming.")
print(result)
# Output: <re.Match object; span=(7, 13), match='Python'>
Case-Insensitive Matching
result = re.search(r"python", "I love Python programming.", re.IGNORECASE)
print(result)
# Output: <re.Match object; span=(7, 13), match='Python'>
Special Characters in Regular Expressions
Wildcards
-
Dot
.
: Matches any single character except a newline.re.search(r"p.ng", "penguin") # Matches 'peng'
Anchors
-
Caret
^
: Matches the start of a string. -
Dollar
$
: Matches the end of a string.re.search(r"^Start", "Start of the line") # Matches 'Start'
re.search(r"end$", "This is the end") # Matches 'end'
Character Classes and Sets
Character Classes
-
Square Brackets
[]
: Define a set of characters to match.re.findall(r"[aeiou]", "Python") # Matches ['o']
-
Ranges:
re.findall(r"[a-z]", "Python") # Matches lowercase letters
-
Negation
^
: Matches any character not in the set.re.findall(r"[^aeiou]", "Python") # Matches ['P', 'y', 't', 'h', 'n']
Alternation
-
Pipe
|
: Matches one pattern or another.re.search(r"cat|dog", "I have a dog.") # Matches 'dog'
Quantifiers
Repetition Qualifiers
-
Asterisk
*
: Matches zero or more repetitions.re.search(r"Py.*n", "Python") # Matches 'Python'
-
Plus
+
: Matches one or more repetitions.re.search(r"go+gle", "goooooogle") # Matches 'goooooogle'
-
Question Mark
?
: Matches zero or one repetition.re.search(r"colou?r", "color") # Matches 'color'
re.search(r"colou?r", "colour") # Matches 'colour'
Specific Number of Repetitions
-
Exact
{n}
: Matches exactlyn
repetitions.re.search(r"A{3}", "AAABC") # Matches 'AAA'
-
Range
{n,m}
: Matches betweenn
andm
repetitions.re.search(r"\d{2,4}", "12345") # Matches '1234'
Metacharacters and Special Sequences
Metacharacters
-
\d
: Matches any digit (equivalent to[0-9]
). -
\w
: Matches any word character (letters, digits, underscore). -
\s
: Matches any whitespace character (space, tab, newline). -
\b
: Matches a word boundary.re.findall(r"\b\w{5}\b", "These are some words") # Matches words with exactly 5 letters
Escaping Special Characters
-
Use backslash
\
to escape special characters.re.search(r"\.com", "www.example.com") # Matches '.com'
Grouping and Capturing
Grouping Patterns
-
Parentheses
()
: Group parts of the pattern.pattern = r"(\w+), (\w+)"
match = re.search(pattern, "Doe, John")
print(match.groups()) # Output: ('Doe', 'John')
Named Groups
-
Syntax:
(?P<name>pattern)
pattern = r"(?P<last>\w+), (?P<first>\w+)"
match = re.search(pattern, "Doe, John")
print(match.group('first')) # Output: 'John'
Backreferences
-
Refer to captured groups later in the pattern.
pattern = r"(\b\w+)\s+\1"
re.search(pattern, "hello hello") # Matches 'hello hello'
Look-ahead and Look-behind Assertions
Look-ahead
-
Positive Look-ahead
(?=...)
: Asserts that the given pattern follows.pattern = r"\w+(?=;)"
re.findall(pattern, "word1; word2; word3") # Matches ['word1', 'word2']
Look-behind
-
Positive Look-behind
(?<=...)
: Asserts that the given pattern precedes.pattern = r"(?<=\$)\d+"
re.findall(pattern, "Price: $100, Discount: $20") # Matches ['100', '20']
Practical Examples
Extracting Process IDs from Logs
def extract_pid(log_line):
pattern = r"\[(\d+)\]"
match = re.search(pattern, log_line)
if match:
return match.group(1)
return None
log = "Jul 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
pid = extract_pid(log)
print(pid) # Output: '12345'
Validating Python Variable Names
def is_valid_variable(name):
pattern = r"^[a-zA-Z_][a-zA-Z0-9_]*$"
return bool(re.match(pattern, name))
print(is_valid_variable("variable_1")) # Output: True
print(is_valid_variable("1_variable")) # Output: False
Extracting Email Addresses
text = "Please contact us at support@example.com or sales@example.org."
emails = re.findall(r"[\w\.-]+@[\w\.-]+", text)
print(emails) # Output: ['support@example.com', 'sales@example.org']
Redacting Sensitive Information
text = "User's SSN is 123-45-6789."
redacted = re.sub(r"\d{3}-\d{2}-\d{4}", "[REDACTED]", text)
print(redacted) # Output: "User's SSN is [REDACTED]."
Splitting and Replacing
Splitting Strings
-
Using
re.split()
:text = "One sentence. Another one? And the last one!"
sentences = re.split(r"[.?!]\s*", text)
print(sentences)
# Output: ['One sentence', 'Another one', 'And the last one!']
Replacing Patterns
-
Using
re.sub()
:text = "The price is $100."
new_text = re.sub(r"\$\d+", "[CENSORED]", text)
print(new_text) # Output: "The price is [CENSORED]."
Advanced Topics
Verbose Mode for Complex Patterns
-
Using
re.VERBOSE
: Allows splitting the pattern into multiple lines with comments.pattern = r"""
(?P<first_name>\w+) # First name
\s+ # Whitespace
(?P<last_name>\w+) # Last name
"""
match = re.search(pattern, "John Doe", re.VERBOSE)
print(match.groupdict()) # Output: {'first_name': 'John', 'last_name': 'Doe'}
Working with Multiline Text
-
Using
re.MULTILINE
: Changes the behavior of^
and$
to match the start and end of each line.text = """First line
Second line Third line""" lines = re.findall(r"^(\w+)", text, re.MULTILINE) print(lines) # Output: ['First', 'Second', 'Third']
## Tokenizing Strings
- **Tokenizing**: Separating a string into substrings based on patterns.
```python
text = "Amy works diligently. Amy gets good grades. Our student Amy is successful."
tokens = re.split(r"Amy", text)
print(tokens)
# Output: ['', ' works diligently. ', ' gets good grades. Our student ', ' is successful.']
-
Finding All Occurrences:
names = re.findall(r"Amy", text)
print(names) # Output: ['Amy', 'Amy', 'Amy']
Practical Examples
Extracting Headers from Wikipedia Data
-
Example Pattern:
pattern = r"([\w ]*)(?=\$\$edit\$\$)"
matches = re.findall(pattern, wiki_text)
print(matches) -
Using Named Groups and Verbose Mode:
pattern = r"""
(?P<title>[\w ]+) # Article title
(?=\$\$edit\$\$) # Look-ahead for '$$edit$$'
"""
for match in re.finditer(pattern, wiki_text, re.VERBOSE):
print(match.group('title'))
Extracting Hashtags from Tweets
-
Pattern to Extract Hashtags:
pattern = r"#\w+(?=\s)"
hashtags = re.findall(pattern, tweet_text)
print(hashtags)
Conclusion
Regular expressions are powerful tools for text processing, enabling complex pattern matching and manipulation. They are widely used in programming languages and command-line tools for tasks such as data validation, parsing logs, and extracting information. Mastery of regular expressions enhances your ability to efficiently handle and process textual data.
Yes, there are many online resources available for practicing regular expressions (regex). Here are some popular options for regex exercises and practice:
Online Regex Practice Platforms
- Regex Tuesday - Challenges
- Python Regular Expression - Exercises, Practice, Solution - w3resource
- RegExr: Learn, Build, & Test RegEx
- Regex Learn - Step by step, from zero to advanced.
- RegexOne - Learn Regular Expressions - Lesson 1: An Introduction, and the ABCs
- regex101: build, test, and debug regex
- Regular expression exercises | Sketch Engine