'Calculating correlation of two text columns in Pandas
I'm good in Python but new to Pandas and know almost nothing about stats so forgive me if this is a simple or ignorant question.
Say I have a dataframe with two columns like, e.g., Jobs and Cars where the entries in both are one of a finite set of strings, e.g. [Software Engineer, Sysadmin, Product Manager] and [Tesla, Hummer, Ford Focus].
I want to produce a table of some sort showing the correlation between jobs and cars, like this:
| | Tesla | Hummer | Ford Focus |
| SWE | ### | ### | ### |
| SA | ### | ### | ### |
| PM | ### | ### | ### |
What's the most pythonic way of doing this? Honestly this is a one off query and my data set is pretty small so it doesn't need to be the most efficient.
Edit: A sample dataframe could be generated like this
from random import choice
jobs = ['SWE', 'Data Scientist', 'Product Manager', 'Sysadmin', 'Data Engineer']
cars = ['Tesla', 'Hummer', 'Ford Focus', 'Chevy Volt', 'Toyota Tercel']
df = pd.DataFrame({
'jobs': [choice(jobs) for _ in range(1000) ],
'cars': [choice(cars) for _ in range(1000) ]
})
The expected output would be similar to that of DataFrame.corr() but that function only operates on numbers and these are strings.
Solution 1:[1]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | onyambu |
