Skip to main content

Merging DataFrames

Relational Theory Concepts

Database Terminology

  • Full Outer Join (Union): Includes all individuals from both sets (students and staff).
  • Inner Join (Intersection): Includes only individuals who are both students and staff.
  • Left Join: Includes all records from the left set (e.g., students) and the matched records from the right set (e.g., staff). If there is no match, the result will have NULL on the side of the right set.
  • Right Join: Includes all records from the right set (e.g., staff) and the matched records from the left set (e.g., students). If there is no match, the result will have NULL on the side of the left set.

Merging DataFrames in Pandas

Creating Sample DataFrames

import pandas as pd

# Staff DataFrame
staff_df = pd.DataFrame([
{'Name': 'Kelly', 'Role': 'Director of HR'},
{'Name': 'Sally', 'Role': 'Course liaison'},
{'Name': 'James', 'Role': 'Grader'}
]).set_index('Name')

# Student DataFrame
student_df = pd.DataFrame([
{'Name': 'James', 'School': 'Business'},
{'Name': 'Mike', 'School': 'Law'},
{'Name': 'Sally', 'School': 'Engineering'}
]).set_index('Name')

print(staff_df.head())
print(student_df.head())

Full Outer Join (Union)

# Union of the DataFrames
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

Inner Join (Intersection)

# Intersection of the DataFrames
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)

Left Join

# Left Join (all staff and their student details if applicable)
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

Right Join

# Right Join (all students and their staff details if applicable)
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

Merging on Columns

# Reset index for merging on columns
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()

# Merge using the 'Name' column
pd.merge(staff_df, student_df, how='right', on='Name')

Handling Conflicts with Suffixes

# New DataFrames with Location information
staff_df = pd.DataFrame([
{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
{'Name': 'Sally', 'Role': 'Course liaison', 'Location': 'Washington Avenue'},
{'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}
])
student_df = pd.DataFrame([
{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'},
{'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'},
{'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}
])

# Merge with conflicts resolved by suffixes
pd.merge(staff_df, student_df, how='left', on='Name')

Multi-Indexing and Multiple Columns

# DataFrames with First and Last Names
staff_df = pd.DataFrame([
{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
{'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liaison'},
{'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}
])
student_df = pd.DataFrame([
{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
{'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
{'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}
])

# Merge on multiple columns
pd.merge(staff_df, student_df, how='inner', on=['First Name', 'Last Name'])

Concatenating DataFrames

Example with College Scorecard Data

# Suppressing warnings for demonstration purposes
%%capture
df_2011 = pd.read_csv("datasets/college_scorecard/MERGED2011_12_PP.csv", error_bad_lines=False)
df_2012 = pd.read_csv("datasets/college_scorecard/MERGED2012_13_PP.csv", error_bad_lines=False)
df_2013 = pd.read_csv("datasets/college_scorecard/MERGED2013_14_PP.csv", error_bad_lines=False)

# Concatenate DataFrames
frames = [df_2011, df_2012, df_2013]
combined_df = pd.concat(frames)

# Adding keys to keep track of the source DataFrame
combined_df_with_keys = pd.concat(frames, keys=['2011', '2012', '2013'])

Handling Columns with Inner and Outer Methods

# Inner and Outer Concatenation
inner_concat = pd.concat([df_2011, df_2012], join='inner')
outer_concat = pd.concat([df_2011, df_2012], join='outer')