Skip to main content

Pivot Tables

A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of the aggregation function. A pivot table is itself a DataFrame where the rows represent one variable, the columns represent another, and the cells contain some aggregate value. Pivot tables often include marginal values, which are the sums for each column and row. This allows for an easy visual representation of the relationship between two variables.

Categorizing Data

You can create a new column to categorize data based on certain conditions. For example, categorizing universities based on their world ranking:

def create_category(ranking):
if (ranking >= 1) & (ranking <= 100):
return "First Tier Top University"
elif (ranking >= 101) & (ranking <= 200):
return "Second Tier Top University"
elif (ranking >= 201) & (ranking <= 300):
return "Third Tier Top University"
return "Other Top University"

df['Rank_Level'] = df['world_rank'].apply(lambda x: create_category(x))
df.head()

Creating a Basic Pivot Table

To create a pivot table comparing rank levels versus country of universities in terms of overall score:

pivot = df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=np.mean)
pivot.head()

This creates a hierarchical DataFrame where the index (rows) are by country and the columns have two levels: the mean value and the rank levels.

Handling NaN Values

NaN values indicate missing data. For instance, if a country has observations only in the "Other Top Universities" category, it will have NaN values for the other categories.

Using Multiple Aggregation Functions

You can pass multiple functions to the aggfunc parameter:

pivot_multi = df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max])
pivot_multi.head()

This will add hierarchical column names showing both the mean and the maximum values.

Adding Marginal Values

To include marginal values for overall averages and maximums:

pivot_margins = df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max], margins=True)
pivot_margins.head()

Accessing Data in Pivot Tables

You can access specific data in the pivot table similar to a regular DataFrame. For example, to get the average scores of First Tier Top Universities in each country:

first_tier_mean = pivot_multi['mean']['First Tier Top University']
first_tier_mean.head()

To find the country with the maximum average score for First Tier Top Universities:

first_tier_mean.idxmax()

Reshaping Pivot Tables with Stack and Unstack

Stacking pivots the lowermost column index to become the innermost row index. Unstacking is the inverse operation.

stacked_df = pivot_margins.stack()
stacked_df.head()

unstacked_df = stacked_df.unstack()
unstacked_df.head()

Unstacking twice in a row returns a series object, as it unravels all hierarchical indices.

unstacked_twice = stacked_df.unstack().unstack()
unstacked_twice.head()