Skip to main content

`DataFrame` data structure

The DataFrame data structure is the heart of the Panda's library. It's a primary object used in data analysis and cleaning tasks. Conceptually, a DataFrame is a two-dimensional series object with an index and multiple columns of content, each column having a label. Essentially, a DataFrame is a two-axes labeled array.

Creating a DataFrame

Importing Pandas

import pandas as pd

Creating DataFrame from Series

Example of creating three school records for students and their class grades using pd.Series:

record1 = pd.Series({'Name': 'Alice', 'Class': 'Physics', 'Score': 85})
record2 = pd.Series({'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82})
record3 = pd.Series({'Name': 'Helen', 'Class': 'Biology', 'Score': 90})

df = pd.DataFrame([record1, record2, record3], index=['school1', 'school2', 'school1'])
df.head()

This creates a DataFrame with each series representing a row of data. The head() function shows the first several rows of the DataFrame.

Creating DataFrame from List of Dictionaries

An alternative method using a list of dictionaries:

students = [{'Name': 'Alice', 'Class': 'Physics', 'Score': 85},
{'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82},
{'Name': 'Helen', 'Class': 'Biology', 'Score': 90}]

df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])
df.head()

Extracting Data from DataFrame

Using .loc and .iloc

  • Single Row Selection: Using .loc with one parameter returns a Series.

    df.loc['school2']
    type(df.loc['school2'])
  • Multiple Rows Selection: Using .loc with a non-unique index returns a DataFrame.

    df.loc['school1']
    type(df.loc['school1'])
  • Selecting Specific Column for Specific Rows:

    df.loc['school1', 'Name']

Using Transpose (.T)

Transpose the DataFrame to pivot rows into columns.

df.T.loc['Name']

Column Selection

Directly using the indexing operator for column selection:

df['Name']
type(df['Name'])

Avoid Chaining Operations

Chaining operations can cause Pandas to return a copy instead of a view, which might be slower and cause errors during data modification. Instead, use .loc with two parameters for more efficient and clear operations:

df.loc['school1']['Name']
print(type(df.loc['school1'])) # DataFrame
print(type(df.loc['school1']['Name'])) # Series

Slicing and Selecting Multiple Columns

Using .loc to select all rows and specific columns:

df.loc[:, ['Name', 'Score']]

Dropping Data

Using .drop()

Drop rows or columns from DataFrame.

df.drop('school1')
df

# With inplace=True
df.drop('school1', inplace=True)
df

Dropping Columns

Two methods to drop columns:

  1. Using .drop() with axis parameter:
    copy_df = df.copy()
    copy_df.drop("Name", inplace=True, axis=1)
  2. Using del keyword:
    del copy_df['Class']

Adding New Columns

Adding a new column by assigning a value:

df['ClassRanking'] = None
df