Idioms
In Python programming, certain idioms are considered more appropriate and efficient. Pandas, a sub-language within Python, has its own idioms, often referred to as "pandorable." These idioms improve code readability and performance. Here are some key features to make your code pandorable.
Method Chaining
Method chaining in pandas allows multiple operations on a DataFrame to be condensed into a single statement.
Pandorable Example
This example pulls out state and city names as a multi-index, filters for county-level data, and renames a column:
(df.where(df['SUMLEV'] == 50)
.dropna()
.set_index(['STNAME', 'CTYNAME'])
.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))
Non-Pandorable Example
A more traditional, non-pandorable approach:
df = df[df['SUMLEV'] == 50] # Filters data
df.set_index(['STNAME', 'CTYNAME'], inplace=True) # Sets new index
df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}) # Renames column
Performance Comparison
Compare the performance of both approaches using the timeit module:
First Approach
def first_approach():
global df
return (df.where(df['SUMLEV'] == 50)
.dropna()
.set_index(['STNAME', 'CTYNAME'])
.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))
df = pd.read_csv('datasets/census.csv')
timeit.timeit(first_approach, number=10)
Second Approach
def second_approach():
global df
new_df = df[df['SUMLEV'] == 50]
new_df.set_index(['STNAME', 'CTYNAME'], inplace=True)
return new_df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})
df = pd.read_csv('datasets/census.csv')
timeit.timeit(second_approach, number=10)
Apply Function
The apply function allows row-wise operations on DataFrames.
Example: Calculating Min and Max
Define a function to calculate the minimum and maximum values for population estimates:
def min_max(row):
data = row[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']]
return pd.Series({'min': np.min(data), 'max': np.max(data)})
df.apply(min_max, axis='columns').head()
Adding New Columns
Modify the function to add new columns directly to the DataFrame:
def min_max(row):
data = row[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']]
row['max'] = np.max(data)
row['min'] = np.min(data)
return row
df.apply(min_max, axis='columns')
Using Lambdas
A lambda function to calculate the maximum value across specific columns:
rows = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
df.apply(lambda x: np.max(x[rows]), axis=1).head()
Custom Functions with Apply
Custom functions can categorize or manipulate data based on specific logic.
Example: Categorizing States into Regions
Define a function to determine the region of a state:
def get_state_region(x):
northeast = ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island','Vermont','New York','New Jersey','Pennsylvania']
midwest = ['Illinois','Indiana','Michigan','Ohio','Wisconsin','Iowa','Kansas','Minnesota','Missouri','Nebraska','North Dakota','South Dakota']
south = ['Delaware','Florida','Georgia','Maryland','North Carolina','South Carolina','Virginia','District of Columbia','West Virginia','Alabama','Kentucky','Mississippi','Tennessee','Arkansas','Louisiana','Oklahoma','Texas']
west = ['Arizona','Colorado','Idaho','Montana','Nevada','New Mexico','Utah','Wyoming','Alaska','California','Hawaii','Oregon','Washington']
if x in northeast:
return "Northeast"
elif x in midwest:
return "Midwest"
elif x in south:
return "South"
else:
return "West"
Applying the Function
Use the apply function to create a new column for state regions:
df['state_region'] = df['STNAME'].apply(lambda x: get_state_region(x))
df[['STNAME','state_region']].head()