Skip to main content

Querying `DataFrame`

Boolean Masking

Boolean masking is fundamental to fast and efficient querying in numpy and pandas. It allows you to filter data based on certain criteria using boolean conditions.

Boolean Mask

A Boolean mask is an array of the same shape as the data you are querying, containing True or False values. This mask is overlaid on the data, allowing the selection of elements aligned with True values while excluding those aligned with False values.

Example with Graduate Admission Dataset

Let's go through an example step-by-step:

  1. Import Pandas and Load Dataset

    import pandas as pd
    df = pd.read_csv('datasets/Admission_Predict.csv', index_col=0)
    df.columns = [x.lower().strip() for x in df.columns]
    df.head()
  2. Creating a Boolean Mask To filter students with a chance of admit higher than 0.7:

    admit_mask = df['chance of admit'] > 0.7
    admit_mask
  3. Applying the Boolean Mask Use the .where() method to apply the mask and retain the data that meets the condition:

    df.where(admit_mask).head()
  4. Dropping NaN Values To remove rows that do not meet the condition:

    df.where(admit_mask).dropna().head()
  5. Shorthand Syntax for Querying Pandas provides a shorthand syntax combining .where() and dropna() using the indexing operator:

    df[df['chance of admit'] > 0.7].head()

Indexing Operator

The indexing operator [] in pandas can be used in multiple ways:

  • Project a Single Column
    df["gre score"].head()
  • Project Multiple Columns
    df[["gre score", "toefl score"]].head()
  • Apply a Boolean Mask
    df[df["gre score"] > 320].head()

Combining Multiple Boolean Masks

Combining multiple criteria for filtering is often necessary. This can be done using logical operators:

  • Logical AND: Use &

    (df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)
  • Logical OR: Use |

    (df['chance of admit'] > 0.7) | (df['gre score'] > 320)

Important: Order of Operations

Always use parentheses to ensure the correct order of operations in boolean comparisons:

df['chance of admit'] > 0.7 & df['chance of admit'] < 0.9  # Incorrect
df['chance of admit'] > 0.7 & (df['chance of admit'] < 0.9) # Correct

Built-in Functions for Comparisons

Pandas provides built-in functions for comparisons which can be more readable:

df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)
df['chance of admit'].gt(0.7).lt(0.9)