regex

Regular expressions (regexes) are patterns used to match character combinations in strings. They are essential tools for text processing in programming and data science, allowing for:

Checking the presence of patterns: Verify if certain patterns exist within text data.
Extracting instances of complex patterns: Retrieve specific data that matches a pattern.
Cleaning and manipulating data: Perform operations like replacing, splitting, or formatting text.

What Are Regular Expressions?

A regular expression is a sequence of characters that define a search pattern. They enable you to:

Find specific words or phrases: Such as all four-letter words in a text.
Parse log files: Extract error messages, process IDs, or timestamps.
Validate input data: Ensure data matches a required format, like email addresses or phone numbers.

Regular expressions are powerful tools used across various programming languages and command-line tools, making them invaluable for developers, data scientists, and system administrators.

Basic Matching with Regular Expressions

Using the `re` Module in Python

Python provides the re module for working with regular expressions.

Importing the Module

import re

Key Functions

match(): Checks for a match at the beginning of the string.
search(): Searches for a pattern anywhere in the string.
findall(): Returns a list of all non-overlapping matches.
split(): Splits the string by occurrences of the pattern.
sub(): Replaces occurrences of the pattern with a replacement string.

Basic Usage

text = "This is a good day."
if re.search("good", text):
    print("Wonderful!")
else:
    print("Alas :(")

Simple String Matching

Searching for Substrings

result = re.search(r"Python", "I love Python programming.")
print(result)
# Output: <re.Match object; span=(7, 13), match='Python'>

Case-Insensitive Matching

result = re.search(r"python", "I love Python programming.", re.IGNORECASE)
print(result)
# Output: <re.Match object; span=(7, 13), match='Python'>

Special Characters in Regular Expressions

Wildcards

Dot .: Matches any single character except a newline.

re.search(r"p.ng", "penguin")  # Matches 'peng'

Anchors

Caret ^: Matches the start of a string.

Dollar $: Matches the end of a string.

re.search(r"^Start", "Start of the line")  # Matches 'Start'
re.search(r"end$", "This is the end")      # Matches 'end'

Character Classes and Sets

Character Classes

Square Brackets []: Define a set of characters to match.
```
re.findall(r"[aeiou]", "Python")  # Matches ['o']
```

Ranges:

re.findall(r"[a-z]", "Python")  # Matches lowercase letters

Negation ^: Matches any character not in the set.

re.findall(r"[^aeiou]", "Python")  # Matches ['P', 'y', 't', 'h', 'n']

Alternation

Pipe |: Matches one pattern or another.

re.search(r"cat|dog", "I have a dog.")  # Matches 'dog'

Quantifiers

Repetition Qualifiers

Asterisk *: Matches zero or more repetitions.

re.search(r"Py.*n", "Python")  # Matches 'Python'

Plus +: Matches one or more repetitions.

re.search(r"go+gle", "goooooogle")  # Matches 'goooooogle'

Question Mark ?: Matches zero or one repetition.

re.search(r"colou?r", "color")    # Matches 'color'
re.search(r"colou?r", "colour")   # Matches 'colour'

Specific Number of Repetitions

Exact {n}: Matches exactly n repetitions.

re.search(r"A{3}", "AAABC")  # Matches 'AAA'

Range {n,m}: Matches between n and m repetitions.

re.search(r"\d{2,4}", "12345")  # Matches '1234'

Metacharacters and Special Sequences

Metacharacters

\d: Matches any digit (equivalent to [0-9]).
\w: Matches any word character (letters, digits, underscore).
\s: Matches any whitespace character (space, tab, newline).

\b: Matches a word boundary.

re.findall(r"\b\w{5}\b", "These are some words")  # Matches words with exactly 5 letters

Escaping Special Characters

Use backslash \ to escape special characters.

re.search(r"\.com", "www.example.com")  # Matches '.com'

Grouping and Capturing

Grouping Patterns

Parentheses (): Group parts of the pattern.

pattern = r"(\w+), (\w+)"
match = re.search(pattern, "Doe, John")
print(match.groups())  # Output: ('Doe', 'John')

Named Groups

Syntax: (?P<name>pattern)

pattern = r"(?P<last>\w+), (?P<first>\w+)"
match = re.search(pattern, "Doe, John")
print(match.group('first'))  # Output: 'John'

Backreferences

Refer to captured groups later in the pattern.

pattern = r"(\b\w+)\s+\1"
re.search(pattern, "hello hello")  # Matches 'hello hello'

Look-ahead and Look-behind Assertions

Look-ahead

Positive Look-ahead (?=...): Asserts that the given pattern follows.

pattern = r"\w+(?=;)"
re.findall(pattern, "word1; word2; word3")  # Matches ['word1', 'word2']

Look-behind

Positive Look-behind (?<=...): Asserts that the given pattern precedes.

pattern = r"(?<=\$)\d+"
re.findall(pattern, "Price: $100, Discount: $20")  # Matches ['100', '20']

Practical Examples

Extracting Process IDs from Logs

def extract_pid(log_line):
    pattern = r"\[(\d+)\]"
    match = re.search(pattern, log_line)
    if match:
        return match.group(1)
    return None

log = "Jul 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
pid = extract_pid(log)
print(pid)  # Output: '12345'

Validating Python Variable Names

def is_valid_variable(name):
    pattern = r"^[a-zA-Z_][a-zA-Z0-9_]*$"
    return bool(re.match(pattern, name))

print(is_valid_variable("variable_1"))  # Output: True
print(is_valid_variable("1_variable"))  # Output: False

Extracting Email Addresses

text = "Please contact us at support@example.com or sales@example.org."
emails = re.findall(r"[\w\.-]+@[\w\.-]+", text)
print(emails)  # Output: ['support@example.com', 'sales@example.org']

Redacting Sensitive Information

text = "User's SSN is 123-45-6789."
redacted = re.sub(r"\d{3}-\d{2}-\d{4}", "[REDACTED]", text)
print(redacted)  # Output: "User's SSN is [REDACTED]."

Splitting and Replacing

Splitting Strings

Using re.split():

text = "One sentence. Another one? And the last one!"
sentences = re.split(r"[.?!]\s*", text)
print(sentences)
# Output: ['One sentence', 'Another one', 'And the last one!']

Replacing Patterns

Using re.sub():

text = "The price is $100."
new_text = re.sub(r"\$\d+", "[CENSORED]", text)
print(new_text)  # Output: "The price is [CENSORED]."

Advanced Topics

Verbose Mode for Complex Patterns

Using re.VERBOSE: Allows splitting the pattern into multiple lines with comments.

pattern = r"""
    (?P<first_name>\w+)       # First name
    \s+                       # Whitespace
    (?P<last_name>\w+)        # Last name
"""
match = re.search(pattern, "John Doe", re.VERBOSE)
print(match.groupdict())  # Output: {'first_name': 'John', 'last_name': 'Doe'}

Working with Multiline Text

Using re.MULTILINE: Changes the behavior of ^ and $ to match the start and end of each line.
```
text = """First line
```

Second line Third line""" lines = re.findall(r"^(\w+)", text, re.MULTILINE) print(lines) # Output: ['First', 'Second', 'Third']

## Tokenizing Strings

- **Tokenizing**: Separating a string into substrings based on patterns.

```python
text = "Amy works diligently. Amy gets good grades. Our student Amy is successful."
tokens = re.split(r"Amy", text)
print(tokens)
# Output: ['', ' works diligently. ', ' gets good grades. Our student ', ' is successful.']

Finding All Occurrences:

names = re.findall(r"Amy", text)
print(names)  # Output: ['Amy', 'Amy', 'Amy']

Practical Examples

Extracting Headers from Wikipedia Data

Example Pattern:

pattern = r"([\w ]*)(?=\$\$edit\$\$)"
matches = re.findall(pattern, wiki_text)
print(matches)

Using Named Groups and Verbose Mode:

pattern = r"""
    (?P<title>[\w ]+)  # Article title
    (?=\$\$edit\$\$)   # Look-ahead for '$$edit$$'
"""
for match in re.finditer(pattern, wiki_text, re.VERBOSE):
    print(match.group('title'))

Extracting Hashtags from Tweets

Pattern to Extract Hashtags:

pattern = r"#\w+(?=\s)"
hashtags = re.findall(pattern, tweet_text)
print(hashtags)

Conclusion

Regular expressions are powerful tools for text processing, enabling complex pattern matching and manipulation. They are widely used in programming languages and command-line tools for tasks such as data validation, parsing logs, and extracting information. Mastery of regular expressions enhances your ability to efficiently handle and process textual data.

Yes, there are many online resources available for practicing regular expressions (regex). Here are some popular options for regex exercises and practice:

What Are Regular Expressions?​

Basic Matching with Regular Expressions​

Using the re Module in Python​

Importing the Module​

Key Functions​

Basic Usage​

Simple String Matching​

Searching for Substrings​

Case-Insensitive Matching​

Special Characters in Regular Expressions​

Wildcards​

Anchors​

Character Classes and Sets​

Character Classes​

Alternation​

Quantifiers​

Repetition Qualifiers​

Specific Number of Repetitions​

Metacharacters and Special Sequences​

Metacharacters​

Escaping Special Characters​

Grouping and Capturing​

Grouping Patterns​

Named Groups​

Backreferences​

Look-ahead and Look-behind Assertions​

Look-ahead​

Look-behind​

Practical Examples​

Extracting Process IDs from Logs​

Validating Python Variable Names​

Extracting Email Addresses​

Redacting Sensitive Information​

Splitting and Replacing​

Splitting Strings​

Replacing Patterns​

Advanced Topics​

Verbose Mode for Complex Patterns​

Working with Multiline Text​

Practical Examples​

Extracting Headers from Wikipedia Data​

Extracting Hashtags from Tweets​

Conclusion​

Online Regex Practice Platforms​

What Are Regular Expressions?

Basic Matching with Regular Expressions

Using the `re` Module in Python

Importing the Module

Key Functions

Basic Usage

Simple String Matching

Searching for Substrings

Case-Insensitive Matching

Special Characters in Regular Expressions

Wildcards

Anchors

Character Classes and Sets

Character Classes

Alternation

Quantifiers

Repetition Qualifiers

Specific Number of Repetitions

Metacharacters and Special Sequences

Metacharacters

Escaping Special Characters

Grouping and Capturing

Grouping Patterns

Named Groups

Backreferences

Look-ahead and Look-behind Assertions

Look-ahead

Look-behind

Practical Examples

Extracting Process IDs from Logs

Validating Python Variable Names

Extracting Email Addresses

Redacting Sensitive Information

Splitting and Replacing

Splitting Strings

Replacing Patterns

Advanced Topics

Verbose Mode for Complex Patterns

Working with Multiline Text

Practical Examples

Extracting Headers from Wikipedia Data

Extracting Hashtags from Tweets

Conclusion

Online Regex Practice Platforms