'How to plot correlation matrix/heatmap with categorical and numerical variables
I have 4 variables of which 2 variables are nominal (dtype=object) and 2 are numeric(dtypes=int and float).
df.head(1)
OUT:
OS_type|Week_day|clicks|avg_app_speed
iOS|Monday|400|3.4
Now, I want to throw the dataframe into a seaborn heatmap visualization.
import numpy as np
import seaborn as sns
ax = sns.heatmap(df)
But I get an error indicating I cannot use categorical variables, only numbers. How do I process this correctly and then feed it back into the heatmap?
Solution 1:[1]
The heatmap to be plotted needs values between 0 and 1. For correlations between numerical variables you can use Pearson's R, for categorical variables (the corrected) Cramer's V, and for correlations between categorical and numerical variables you can use the correlation ratio.
As for creating numerical representations of categorical variables there is a number of ways to do that:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('some_source.csv') # has categorical var 'categ_var'
# method 1: uses pandas
df['numerized1'] = df['categ_var'].astype('category').cat.codes
# method 2: uses pandas, sorts values descending by frequency
df['numerized2'] = df['categ_var'].apply(lambda x: df['categ_var'].value_counts().index.get_loc(x))
# method 3: uses sklearn, result is the same as method 1
lbl = LabelEncoder()
df['numerized3'] = lbl.fit_transform(df['categ_var'])
# method 4: uses pandas; xyz captures a list of the unique values
df['numerized4'], xyz = pd.factorize(df['categ_var'])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Taq Seorangpun |