Skip to main content

Scales

Creating a DataFrame with Letter Grades

We will create a DataFrame of letter grades in descending order and set an index based on human judgment of how good a student was.

df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good',
'ok', 'ok', 'ok', 'poor', 'poor'],
columns=["Grades"])
df

Checking the Data Type of the Column

The data type of the column will be object since we used string values.

df.dtypes

Converting Column to Categorical Type

We can change the column type to category using the astype() function.

df["Grades"].astype("category").head()

Creating an Ordered Categorical Data Type

To make the data ordered, we create a new categorical data type with the list of categories in order and the ordered=True flag.

my_categories = pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], 
ordered=True)

Applying the Ordered Categorical Type

We then pass this new type to the astype() function.

grades = df["Grades"].astype(my_categories)
grades.head()

Using Ordered Categorical for Comparisons

With ordered categories, comparisons and boolean masking work as expected.

# Using the original DataFrame
df[df["Grades"] > "C"]

# Using the ordered categorical grades
grades[grades > "C"]

Dummy Variables

Converting categorical values into dummy/indicator variables using get_dummies.

pd.get_dummies(df, columns=["Grades"]).head()

Converting Numerical Scale to Categorical

We can convert a numerical scale to a categorical one, useful in visualizing frequencies or applying machine learning classification.

Reading Census Data

df = pd.read_csv("datasets/census.csv")
df = df[df['SUMLEV'] == 50]
df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)
df.head()

Creating Bins with cut()

Using cut to create bins from numerical data.

pd.cut(df, 10)

Creating Bins Based on Frequency

For forming categories based on frequency, you can use methods like qcut() instead of cut() to ensure an equal number of items in each bin.

pd.qcut(df, 10)

Summary

  • Categorical Data: Converting a column to category type can help with memory efficiency and comparisons.
  • Ordered Categories: Use CategoricalDtype to specify and enforce order in categories.
  • Dummy Variables: get_dummies can convert categorical variables into multiple columns with binary values.
  • Binning: Use cut() for equal-sized bins and qcut() for bins with equal frequency.

These techniques are essential for preprocessing data, feature extraction, and improving the performance of machine learning models.