Missing Values
Missing values are common in data cleaning activities and can occur for various reasons. Understanding the nature and handling of missing data is crucial for accurate data analysis.
Types of Missing Data
- Missing at Random (MAR): Missing values can be predicted by other variables in the dataset.
- Missing Completely at Random (MCAR): Missing values have no relationship to any other variables.
Reasons for Missing Data
- Survey omissions: Respondents might skip questions.
- Data collection errors: Some data might not be collected due to process errors.
- Irrelevant data: Certain data points might not make sense to collect (e.g., students generally don't have office locations in a university database).
Handling Missing Data in Pandas
Detecting Missing Values
Pandas can detect missing values from various data formats like CSV files, which often label missing values as NaN, NULL, None, or N/A. However, some datasets might use unconventional labels (e.g., 99 for missing binary categories). The read_csv function's na_values parameter can specify these forms.
df = pd.read_csv('datasets/class_grades.csv', na_values=[99])
Checking for Missing Values
The .isnull() function creates a boolean mask indicating the presence of missing values.
mask = df.isnull()
print(mask.head(10))
Dropping Missing Values
The dropna() function removes rows with any missing data.
df_clean = df.dropna()
print(df_clean.head(10))
Filling Missing Values
The fillna() function fills missing values with specified values.
df.fillna(0, inplace=True)
print(df.head(10))
Forward and Backward Filling
- Forward fill (ffill): Replaces NaN with the previous row's value.
- Backward fill (bfill): Replaces NaN with the next row's value.
df = df.fillna(method='ffill')
Sorting Data for Filling
Data needs to be sorted for ffill and bfill to work as intended.
df = df.set_index('time')
df = df.sort_index()
df = df.reset_index()
df = df.set_index(['time', 'user'])
Using replace() for Custom Fill-ins
The replace() function allows various approaches for value replacement, including using regex.
# Replace 1 with 100
df.replace(1, 100)
# Replace multiple values
df.replace([1, 3], [100, 300])
# Replace using regex
df.replace(to_replace=".*.html$", value="webpage", regex=True)
Ignoring Missing Values in Statistical Functions
Pandas statistical functions typically ignore missing values, which is often desirable but should be taken into account based on the specific problem being solved.
Example DataFrame Operations
# Loading data
df = pd.read_csv("datasets/log.csv")
print(df.head(20))
# Replace HTML pages with 'webpage'
df.replace(to_replace=".*.html$", value="webpage", regex=True)