Scales
Creating a DataFrame with Letter Grades
We will create a DataFrame of letter grades in descending order and set an index based on human judgment of how good a student was.
df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good',
'ok', 'ok', 'ok', 'poor', 'poor'],
columns=["Grades"])
df
Checking the Data Type of the Column
The data type of the column will be object since we used string values.
df.dtypes
Converting Column to Categorical Type
We can change the column type to category using the astype() function.
df["Grades"].astype("category").head()
Creating an Ordered Categorical Data Type
To make the data ordered, we create a new categorical data type with the list of categories in order and the ordered=True flag.
my_categories = pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],
ordered=True)
Applying the Ordered Categorical Type
We then pass this new type to the astype() function.
grades = df["Grades"].astype(my_categories)
grades.head()
Using Ordered Categorical for Comparisons
With ordered categories, comparisons and boolean masking work as expected.
# Using the original DataFrame
df[df["Grades"] > "C"]
# Using the ordered categorical grades
grades[grades > "C"]
Dummy Variables
Converting categorical values into dummy/indicator variables using get_dummies.
pd.get_dummies(df, columns=["Grades"]).head()
Converting Numerical Scale to Categorical
We can convert a numerical scale to a categorical one, useful in visualizing frequencies or applying machine learning classification.
Reading Census Data
df = pd.read_csv("datasets/census.csv")
df = df[df['SUMLEV'] == 50]
df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)
df.head()
Creating Bins with cut()
Using cut to create bins from numerical data.
pd.cut(df, 10)
Creating Bins Based on Frequency
For forming categories based on frequency, you can use methods like qcut() instead of cut() to ensure an equal number of items in each bin.
pd.qcut(df, 10)
Summary
- Categorical Data: Converting a column to
categorytype can help with memory efficiency and comparisons. - Ordered Categories: Use
CategoricalDtypeto specify and enforce order in categories. - Dummy Variables:
get_dummiescan convert categorical variables into multiple columns with binary values. - Binning: Use
cut()for equal-sized bins andqcut()for bins with equal frequency.
These techniques are essential for preprocessing data, feature extraction, and improving the performance of machine learning models.