Skip to main content

regex

Regular expressions (regexes) are patterns used to match character combinations in strings. They are essential tools for text processing in programming and data science, allowing for:

  • Checking the presence of patterns: Verify if certain patterns exist within text data.
  • Extracting instances of complex patterns: Retrieve specific data that matches a pattern.
  • Cleaning and manipulating data: Perform operations like replacing, splitting, or formatting text.

What Are Regular Expressions?

A regular expression is a sequence of characters that define a search pattern. They enable you to:

  • Find specific words or phrases: Such as all four-letter words in a text.
  • Parse log files: Extract error messages, process IDs, or timestamps.
  • Validate input data: Ensure data matches a required format, like email addresses or phone numbers.

Regular expressions are powerful tools used across various programming languages and command-line tools, making them invaluable for developers, data scientists, and system administrators.

Basic Matching with Regular Expressions

Using the re Module in Python

Python provides the re module for working with regular expressions.

Importing the Module

import re

Key Functions

  • match(): Checks for a match at the beginning of the string.
  • search(): Searches for a pattern anywhere in the string.
  • findall(): Returns a list of all non-overlapping matches.
  • split(): Splits the string by occurrences of the pattern.
  • sub(): Replaces occurrences of the pattern with a replacement string.

Basic Usage

text = "This is a good day."
if re.search("good", text):
print("Wonderful!")
else:
print("Alas :(")

Simple String Matching

Searching for Substrings

result = re.search(r"Python", "I love Python programming.")
print(result)
# Output: <re.Match object; span=(7, 13), match='Python'>

Case-Insensitive Matching

result = re.search(r"python", "I love Python programming.", re.IGNORECASE)
print(result)
# Output: <re.Match object; span=(7, 13), match='Python'>

Special Characters in Regular Expressions

Wildcards

  • Dot .: Matches any single character except a newline.

    re.search(r"p.ng", "penguin")  # Matches 'peng'

Anchors

  • Caret ^: Matches the start of a string.

  • Dollar $: Matches the end of a string.

    re.search(r"^Start", "Start of the line")  # Matches 'Start'
    re.search(r"end$", "This is the end") # Matches 'end'

Character Classes and Sets

Character Classes

  • Square Brackets []: Define a set of characters to match.

    re.findall(r"[aeiou]", "Python")  # Matches ['o']
  • Ranges:

    re.findall(r"[a-z]", "Python")  # Matches lowercase letters
  • Negation ^: Matches any character not in the set.

    re.findall(r"[^aeiou]", "Python")  # Matches ['P', 'y', 't', 'h', 'n']

Alternation

  • Pipe |: Matches one pattern or another.

    re.search(r"cat|dog", "I have a dog.")  # Matches 'dog'

Quantifiers

Repetition Qualifiers

  • Asterisk *: Matches zero or more repetitions.

    re.search(r"Py.*n", "Python")  # Matches 'Python'
  • Plus +: Matches one or more repetitions.

    re.search(r"go+gle", "goooooogle")  # Matches 'goooooogle'
  • Question Mark ?: Matches zero or one repetition.

    re.search(r"colou?r", "color")    # Matches 'color'
    re.search(r"colou?r", "colour") # Matches 'colour'

Specific Number of Repetitions

  • Exact {n}: Matches exactly n repetitions.

    re.search(r"A{3}", "AAABC")  # Matches 'AAA'
  • Range {n,m}: Matches between n and m repetitions.

    re.search(r"\d{2,4}", "12345")  # Matches '1234'

Metacharacters and Special Sequences

Metacharacters

  • \d: Matches any digit (equivalent to [0-9]).

  • \w: Matches any word character (letters, digits, underscore).

  • \s: Matches any whitespace character (space, tab, newline).

  • \b: Matches a word boundary.

    re.findall(r"\b\w{5}\b", "These are some words")  # Matches words with exactly 5 letters

Escaping Special Characters

  • Use backslash \ to escape special characters.

    re.search(r"\.com", "www.example.com")  # Matches '.com'

Grouping and Capturing

Grouping Patterns

  • Parentheses (): Group parts of the pattern.

    pattern = r"(\w+), (\w+)"
    match = re.search(pattern, "Doe, John")
    print(match.groups()) # Output: ('Doe', 'John')

Named Groups

  • Syntax: (?P<name>pattern)

    pattern = r"(?P<last>\w+), (?P<first>\w+)"
    match = re.search(pattern, "Doe, John")
    print(match.group('first')) # Output: 'John'

Backreferences

  • Refer to captured groups later in the pattern.

    pattern = r"(\b\w+)\s+\1"
    re.search(pattern, "hello hello") # Matches 'hello hello'

Look-ahead and Look-behind Assertions

Look-ahead

  • Positive Look-ahead (?=...): Asserts that the given pattern follows.

    pattern = r"\w+(?=;)"
    re.findall(pattern, "word1; word2; word3") # Matches ['word1', 'word2']

Look-behind

  • Positive Look-behind (?<=...): Asserts that the given pattern precedes.

    pattern = r"(?<=\$)\d+"
    re.findall(pattern, "Price: $100, Discount: $20") # Matches ['100', '20']

Practical Examples

Extracting Process IDs from Logs

def extract_pid(log_line):
pattern = r"\[(\d+)\]"
match = re.search(pattern, log_line)
if match:
return match.group(1)
return None

log = "Jul 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
pid = extract_pid(log)
print(pid) # Output: '12345'

Validating Python Variable Names

def is_valid_variable(name):
pattern = r"^[a-zA-Z_][a-zA-Z0-9_]*$"
return bool(re.match(pattern, name))

print(is_valid_variable("variable_1")) # Output: True
print(is_valid_variable("1_variable")) # Output: False

Extracting Email Addresses

text = "Please contact us at support@example.com or sales@example.org."
emails = re.findall(r"[\w\.-]+@[\w\.-]+", text)
print(emails) # Output: ['support@example.com', 'sales@example.org']

Redacting Sensitive Information

text = "User's SSN is 123-45-6789."
redacted = re.sub(r"\d{3}-\d{2}-\d{4}", "[REDACTED]", text)
print(redacted) # Output: "User's SSN is [REDACTED]."

Splitting and Replacing

Splitting Strings

  • Using re.split():

    text = "One sentence. Another one? And the last one!"
    sentences = re.split(r"[.?!]\s*", text)
    print(sentences)
    # Output: ['One sentence', 'Another one', 'And the last one!']

Replacing Patterns

  • Using re.sub():

    text = "The price is $100."
    new_text = re.sub(r"\$\d+", "[CENSORED]", text)
    print(new_text) # Output: "The price is [CENSORED]."

Advanced Topics

Verbose Mode for Complex Patterns

  • Using re.VERBOSE: Allows splitting the pattern into multiple lines with comments.

    pattern = r"""
    (?P<first_name>\w+) # First name
    \s+ # Whitespace
    (?P<last_name>\w+) # Last name
    """
    match = re.search(pattern, "John Doe", re.VERBOSE)
    print(match.groupdict()) # Output: {'first_name': 'John', 'last_name': 'Doe'}

Working with Multiline Text

  • Using re.MULTILINE: Changes the behavior of ^ and $ to match the start and end of each line.

    text = """First line

Second line Third line""" lines = re.findall(r"^(\w+)", text, re.MULTILINE) print(lines) # Output: ['First', 'Second', 'Third']


## Tokenizing Strings

- **Tokenizing**: Separating a string into substrings based on patterns.

```python
text = "Amy works diligently. Amy gets good grades. Our student Amy is successful."
tokens = re.split(r"Amy", text)
print(tokens)
# Output: ['', ' works diligently. ', ' gets good grades. Our student ', ' is successful.']
  • Finding All Occurrences:

    names = re.findall(r"Amy", text)
    print(names) # Output: ['Amy', 'Amy', 'Amy']

Practical Examples

Extracting Headers from Wikipedia Data

  • Example Pattern:

    pattern = r"([\w ]*)(?=\$\$edit\$\$)"
    matches = re.findall(pattern, wiki_text)
    print(matches)
  • Using Named Groups and Verbose Mode:

    pattern = r"""
    (?P<title>[\w ]+) # Article title
    (?=\$\$edit\$\$) # Look-ahead for '$$edit$$'
    """
    for match in re.finditer(pattern, wiki_text, re.VERBOSE):
    print(match.group('title'))

Extracting Hashtags from Tweets

  • Pattern to Extract Hashtags:

    pattern = r"#\w+(?=\s)"
    hashtags = re.findall(pattern, tweet_text)
    print(hashtags)

Conclusion

Regular expressions are powerful tools for text processing, enabling complex pattern matching and manipulation. They are widely used in programming languages and command-line tools for tasks such as data validation, parsing logs, and extracting information. Mastery of regular expressions enhances your ability to efficiently handle and process textual data.

Yes, there are many online resources available for practicing regular expressions (regex). Here are some popular options for regex exercises and practice:

Online Regex Practice Platforms