'Pandas multiindex dataframe: calculation applied to all columns in an index level
I am working with a large Pandas dataframe that has multiple index levels in both rows and columns, something like:
df
Metric | Population Homes
Year | 2018 2019 2020 2018 2019 2020
Town Sector |
---- ------ | ---- ---- ---- ---- ---- ----
A 1 | 100 110 120 50 52 52
2 | 200 205 210 80 80 80
3 | 300 300 300 100 100 100
B 1 | 50 60 70 20 22 24
2 | 100 100 100 40 40 40
3 | 150 140 130 50 47 44
I need to perform calculations for groups of columns, eg. find ratio between population and homes.
Step by step that would be:
# 1. Calculation
R = df["Population"] / df["Homes"]
R
Year | 2018 2019 2020
Town Sector |
---- ------ | ---- ---- ----
A 1 | 2.0 2.1 2.3
2 | 2.5 2.6 2.6
3 | 3.0 3.0 3.0
B 1 | 2.5 2.7 2.9
2 | 2.5 2.5 2.5
3 | 3.0 3.0 3.0
# 2. Re-build multiindex for columns (there are various levels, showing only one here)
R = pd.concat([R],keys=["Ratio"],axis=1)
R
| Ratio
Year | 2018 2019 2020
Town Sector |
---- ------ | ---- ---- ----
A 1 | 2.0 2.1 2.3
2 | 2.5 2.6 2.6
3 | 3.0 3.0 3.0
B 1 | 2.5 2.7 2.9
2 | 2.5 2.5 2.5
3 | 3.0 3.0 3.0
# 3. Concat previous calculation to the main dataframe
df = pd.concat([df,R],axis=1,sort=True) # I need the sort=True to avoid a performance Warning
df
Metric | Population Homes Ratio
Year | 2018 2019 2020 2018 2019 2020 2018 2019 2020
Town Sector |
---- ------ | ---- ---- ---- ---- ---- ---- ---- ---- ----
A 1 | 100 110 120 50 52 52 2.0 2.1 2.3
2 | 200 205 210 80 80 80 2.5 2.6 2.6
3 | 300 300 300 100 100 100 3.0 3.0 3.0
B 1 | 50 60 70 20 22 24 2.5 2.7 2.9
2 | 100 100 100 40 40 40 2.5 2.5 2.5
3 | 150 140 130 50 47 44 3.0 3.0 3.0
I can write the above expresions in a single line, but as I mentioned I have various index levels and it becomes complicated... Is there a way to do something simpler?
I would have guessed:
df["Ratio"] = df["Population"]/df["Homes"]
But it throws a "ValueError: Item must have length equal to number of levels."
Thanks!
Solution 1:[1]
Let's do some dataframe reshaping like this:
import pandas as pd
from io import StringIO
# Create your input dataframe
csvdata = StringIO("""100 110 120 50 52 52
200 205 210 80 80 80
300 300 300 100 100 100
50 60 70 20 22 24
100 100 100 40 40 40
150 140 130 50 47 44""")
df = pd.read_csv(csvdata, sep='\s\s+', header=None, engine='python')
df.index = pd.MultiIndex.from_product([[*'AB'],[1,2,3]])
df.columns = pd.MultiIndex.from_product(['Population Homes'.split(' '),
[2018, 2019, 2020]])
df_out=df.stack().eval('Ratio = Population / Homes').unstack().round(1)
print(df_out)
Output:
Homes Population Ratio
2018 2019 2020 2018 2019 2020 2018 2019 2020
A 1 50 52 52 100 110 120 2.0 2.1 2.3
2 80 80 80 200 205 210 2.5 2.6 2.6
3 100 100 100 300 300 300 3.0 3.0 3.0
B 1 20 22 24 50 60 70 2.5 2.7 2.9
2 40 40 40 100 100 100 2.5 2.5 2.5
3 50 47 44 150 140 130 3.0 3.0 3.0
Using stack, eval and unstack.
Solution 2:[2]
I've later found this thread, and it's led me in the right direction (or at least what I had in my mind...):
pandas dataframe select columns in multiindex
In order to operate with all columns under the same index level, I had in mind something like this (doesn't work):
df["Ratio"] = df["Population"]/df["Homes"]
From the above thread and pandas doc on slicers (https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#using-slicers) I got to the following expression, which does work:
df.loc[:,("Ratio",slice(None)] = df.loc[:,("Population",slice(None)] / df.loc[:,("Homes",slice(None)]
Changes needed were:
- use .loc
- within the .loc[...], first need to get all rows with the colon ":"
- then use brackets to indicate the multiindex levels
- and use slide(None) to get all values of the index at the last level (for the example above...)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Scott Boston |
| Solution 2 | ruben |
