'count frequency of items in one column in relation to criteria in another column in python
I have a data frame that looks something like this:
| Category | Topic |
|---|---|
| Category1 | Topic1 |
| Category2 | Topic2 |
| Category1 | Topic2 |
| Category3 | Topic3 |
| Category2 | Topic3 |
| Category3 | Topic3 |
And I want an output like this:
| Category | Topic | Frequency |
|---|---|---|
| Category1 | Topic1 | |
| Topic2 | ||
| Topic3 | ||
| Catgeory2 | Topic1 | |
| Topic2 | ||
| Topic3 | ||
| Category3 | Topic1 | |
| Topic2 | ||
| Topic3 |
I am just starting out with python and I'd really appreciate it if someone could help me out with this.
Solution 1:[1]
If the frequency is meant to capture the frequency of topic within each category, then, a basic approch involves:
df.groupby('Category')['Topic'].value_counts(normalize=True)
Which is a Series. For example, on your input data, we get:
Category Topic
Category1 Topic1 0.5
Topic2 0.5
Category2 Topic2 0.5
Topic3 0.5
Category3 Topic3 1.0
Name: Topic, dtype: float64
For an output organized as per your example, that appears to be a DataFrame with three columns:
out = (
df
.groupby('Category')['Topic']
.value_counts(normalize=True)
.to_frame('frequency')
.reset_index()
)
Again, on your input sample:
>>> out
Category Topic frequency
0 Category1 Topic1 0.5
1 Category1 Topic2 0.5
2 Category2 Topic2 0.5
3 Category2 Topic3 0.5
4 Category3 Topic3 1.0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Pierre D |
