'How to efficiently plot stacked count-of-values of selected rows from a MultiIndex-DataFrame with different selection-conditions per stack?

I am new here and not a native English speaker, but I'll try my best.

I want to plot data from a complex pandas DataFrame which has both MultiIndex and nested columns. The data has to be selected according to conditions on multiple sub-columns and then another column is to be plotted as a stacked bar plot by count of its values (see attached image).

I did a few days of research and documentation-reading and tried around with multiple ideas, but accomplished it only using blunt force by saving the counts in another DataFrame (see code below).
This is mostly due to the fact that this is my first serious attempt at using pandas and I am still a bit too overwhelmed to be able to transpose everything I see into my use case. I experimented with groupby(), unstack(), pivot_table(), crosstab(), query()... but to no avail.

The real dataset has around 200 000 rows and 4 condition-groups (not only "online" and "book" like here) so I see performance as an issue when using the iteration over ["online","book"] like I did in my attempt below.

Therefore, my question is:

  • How can I achieve the following plot in a way that corresponds better to the design of pandas (and that would probably be more resource efficient)?
    The resource aspect is secondary to the correct usage of pandas, I am not asking for benchmarks.

(In a possible expansion on the question, I am also curious if the same could be achieved not using the values of the column "Category", but the values of the index "Month" instead.)

(Baking the example: MWE)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(158)

Names= ["Urs", "Evangelina", "Éimhear", "Abdul Hamid", "Doris", "Hande", "Nuur", "Meurig"]
Categories= ["bread", "cake", "pancake"]

columns = [("Author", ""), ("Category", ""),
           ('online', 'uses_eggs'), ('online', 'uses_flour'), ('online', 'uses_sugar'),
           ('book', 'uses_eggs'), ('book', 'uses_flour'), ('book', 'uses_sugar'),]

indices = [("2021", "June", "1"), ("2021", "June", "2"),
           ("2021", "September", "1"), ("2021", "September", "2"), ("2021", "September", "3"),
           ("2022", "May", "1"), ("2022", "May", "2"),
           ("2022", "July", "1"),]

df = pd.DataFrame(
    index   =pd.MultiIndex.from_tuples(indices, names=["Year", "Month", "Number"]),
    columns =pd.MultiIndex.from_tuples(columns),)

df["Author"]   = np.random.choice(Names,len(indices))
df["Category"] = np.random.choice(Categories, len(indices))
df["online"] = np.transpose([
    [1,0,1,3,3,1,1,1],
    [300,400,0,300,100,400,300,0],
    [True, False, True, True, True, True,False,True]    ])
df["book"]   = np.transpose(
    [ np.random.randint(4, size=len(indices)),
     np.random.randint(5, size=len(indices)) * 100,
     np.random.choice([True, False], len(indices))    ])

#df

Resulting test-data

                           Author Category    online                            book                      
                                           uses_eggs uses_flour uses_sugar uses_eggs uses_flour uses_sugar
Year Month     Number                                                                                     
2021 June      1            Hande    bread         1        300          1         3          0          0
               2             Nuur     cake         0        400          0         2          0          1
     September 1              Urs  pancake         1          0          1         1        100          1
               2             Nuur    bread         3        300          1         1        300          1
               3            Hande    bread         3        100          1         0        300          0
2022 May       1          Éimhear     cake         1        400          1         2          0          1
               2       Evangelina    bread         1        300          0         3          0          0
     July      1            Hande     cake         1          0          1         2        200          1

Desired plot

"Bar plot of the number of recipes using eggs AND flour AND sugar, by publication-place of recipes, stacked by category"

Imgur-Link to the plot (I don't have enough reputation yet)

My best attempt so far

## The ugly way:

other_df = pd.DataFrame()

for pub in ["online", "book"]:
    other_df[pub] = df[(
        (df[pub]["uses_eggs"] > 0) & (df[pub]["uses_flour"] >= 100) & (df[pub]["uses_sugar"] == 1)
    )]["Category"].value_counts()

other_df.T.plot(kind="bar", stacked=True)

plt.title("Number of recipes\nusing eggs AND flour AND sugar,
\n by publication-place of recipe, stacked by category")

Thank you very much for reading through. I'll gladly edit the post if deemed helfpul and am willing to read more of course, should you refer me to other links.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source