Missing Values

Missing values are common in data cleaning activities and can occur for various reasons. Understanding the nature and handling of missing data is crucial for accurate data analysis.

Types of Missing Data

Missing at Random (MAR): Missing values can be predicted by other variables in the dataset.
Missing Completely at Random (MCAR): Missing values have no relationship to any other variables.

Reasons for Missing Data

Survey omissions: Respondents might skip questions.
Data collection errors: Some data might not be collected due to process errors.
Irrelevant data: Certain data points might not make sense to collect (e.g., students generally don't have office locations in a university database).

Handling Missing Data in Pandas

Detecting Missing Values

Pandas can detect missing values from various data formats like CSV files, which often label missing values as NaN, NULL, None, or N/A. However, some datasets might use unconventional labels (e.g., 99 for missing binary categories). The read_csv function's na_values parameter can specify these forms.

df = pd.read_csv('datasets/class_grades.csv', na_values=[99])

Checking for Missing Values

The .isnull() function creates a boolean mask indicating the presence of missing values.

mask = df.isnull()
print(mask.head(10))

Dropping Missing Values

The dropna() function removes rows with any missing data.

df_clean = df.dropna()
print(df_clean.head(10))

Filling Missing Values

The fillna() function fills missing values with specified values.

df.fillna(0, inplace=True)
print(df.head(10))

Forward and Backward Filling

Forward fill (ffill): Replaces NaN with the previous row's value.
Backward fill (bfill): Replaces NaN with the next row's value.

df = df.fillna(method='ffill')

Sorting Data for Filling

Data needs to be sorted for ffill and bfill to work as intended.

df = df.set_index('time')
df = df.sort_index()
df = df.reset_index()
df = df.set_index(['time', 'user'])

Using `replace()` for Custom Fill-ins

The replace() function allows various approaches for value replacement, including using regex.

# Replace 1 with 100
df.replace(1, 100)

# Replace multiple values
df.replace([1, 3], [100, 300])

# Replace using regex
df.replace(to_replace=".*.html$", value="webpage", regex=True)

Ignoring Missing Values in Statistical Functions

Pandas statistical functions typically ignore missing values, which is often desirable but should be taken into account based on the specific problem being solved.

Example DataFrame Operations

# Loading data
df = pd.read_csv("datasets/log.csv")
print(df.head(20))

# Replace HTML pages with 'webpage'
df.replace(to_replace=".*.html$", value="webpage", regex=True)

Types of Missing Data​

Reasons for Missing Data​

Handling Missing Data in Pandas​

Detecting Missing Values​

Checking for Missing Values​

Dropping Missing Values​

Filling Missing Values​

Forward and Backward Filling​

Sorting Data for Filling​

Using replace() for Custom Fill-ins​

Ignoring Missing Values in Statistical Functions​

Example DataFrame Operations​