'How to efficiently plot stacked count-of-values of selected rows from a MultiIndex-DataFrame with different selection-conditions per stack?
I am new here and not a native English speaker, but I'll try my best.
I want to plot data from a complex pandas DataFrame which has both MultiIndex and nested columns. The data has to be selected according to conditions on multiple sub-columns and then another column is to be plotted as a stacked bar plot by count of its values (see attached image).
I did a few days of research and documentation-reading and tried around with multiple ideas, but accomplished it only using blunt force by saving the counts in another DataFrame (see code below).
This is mostly due to the fact that this is my first serious attempt at using pandas and I am still a bit too overwhelmed to be able to transpose everything I see into my use case.
I experimented with groupby(), unstack(), pivot_table(), crosstab(), query()... but to no avail.
The real dataset has around 200 000 rows and 4 condition-groups (not only "online" and "book" like here) so I see performance as an issue when using the iteration over ["online","book"] like I did in my attempt below.
Therefore, my question is:
- How can I achieve the following plot in a way that corresponds better to the design of pandas (and that would probably be more resource efficient)?
The resource aspect is secondary to the correct usage of pandas, I am not asking for benchmarks.
(In a possible expansion on the question, I am also curious if the same could be achieved not using the values of the column "Category", but the values of the index "Month" instead.)
(Baking the example: MWE)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(158)
Names= ["Urs", "Evangelina", "Éimhear", "Abdul Hamid", "Doris", "Hande", "Nuur", "Meurig"]
Categories= ["bread", "cake", "pancake"]
columns = [("Author", ""), ("Category", ""),
('online', 'uses_eggs'), ('online', 'uses_flour'), ('online', 'uses_sugar'),
('book', 'uses_eggs'), ('book', 'uses_flour'), ('book', 'uses_sugar'),]
indices = [("2021", "June", "1"), ("2021", "June", "2"),
("2021", "September", "1"), ("2021", "September", "2"), ("2021", "September", "3"),
("2022", "May", "1"), ("2022", "May", "2"),
("2022", "July", "1"),]
df = pd.DataFrame(
index =pd.MultiIndex.from_tuples(indices, names=["Year", "Month", "Number"]),
columns =pd.MultiIndex.from_tuples(columns),)
df["Author"] = np.random.choice(Names,len(indices))
df["Category"] = np.random.choice(Categories, len(indices))
df["online"] = np.transpose([
[1,0,1,3,3,1,1,1],
[300,400,0,300,100,400,300,0],
[True, False, True, True, True, True,False,True] ])
df["book"] = np.transpose(
[ np.random.randint(4, size=len(indices)),
np.random.randint(5, size=len(indices)) * 100,
np.random.choice([True, False], len(indices)) ])
#df
Resulting test-data
Author Category online book
uses_eggs uses_flour uses_sugar uses_eggs uses_flour uses_sugar
Year Month Number
2021 June 1 Hande bread 1 300 1 3 0 0
2 Nuur cake 0 400 0 2 0 1
September 1 Urs pancake 1 0 1 1 100 1
2 Nuur bread 3 300 1 1 300 1
3 Hande bread 3 100 1 0 300 0
2022 May 1 Éimhear cake 1 400 1 2 0 1
2 Evangelina bread 1 300 0 3 0 0
July 1 Hande cake 1 0 1 2 200 1
Desired plot
"Bar plot of the number of recipes using eggs AND flour AND sugar, by publication-place of recipes, stacked by category"
Imgur-Link to the plot (I don't have enough reputation yet)
My best attempt so far
## The ugly way:
other_df = pd.DataFrame()
for pub in ["online", "book"]:
other_df[pub] = df[(
(df[pub]["uses_eggs"] > 0) & (df[pub]["uses_flour"] >= 100) & (df[pub]["uses_sugar"] == 1)
)]["Category"].value_counts()
other_df.T.plot(kind="bar", stacked=True)
plt.title("Number of recipes\nusing eggs AND flour AND sugar,
\n by publication-place of recipe, stacked by category")
Thank you very much for reading through. I'll gladly edit the post if deemed helfpul and am willing to read more of course, should you refer me to other links.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
