'Pandas data frame calculate difference from first rows
I have a dataset where I have the results of a model under many different scenarios and I want to capture the difference between each scenario and the baseline.
Example input data:
| Scenario | ID | Value |
|---|---|---|
| Baseline | 1 | 10 |
| Baseline | 2 | 5 |
| Baseline | 3 | 1 |
| A | 1 | 10 |
| A | 2 | 0 |
| A | 3 | 5 |
Desired result:
| Scenario | ID | Difference from Baseline |
|---|---|---|
| A | 1 | 0 |
| A | 2 | -5 |
| A | 3 | 4 |
I have done this in the past manually in cases where there are 1-2 values which I want to subtract, but now I want to be able to do this for a variable number of columns. I think there will be 20-40 "Value" columns I want to subtract, anywhere from 2-50 unique IDs, and 60+ scenarios.
I can think of two approaches to this but both seem inefficient:
- Start by separating into "baseline" and "all_scenarios" dfs. Get list of columns. Loop through each column in baseline df to rename. Join dataframes on ID. Loop through column list to perform subtraction.
- Start by separating into "baseline" and "all_scenarios" dfs. Get unique list of IDs. Loop through each ID and create a new baseline and all_scenario dfs with only that ID. Use df.subtract() to get the difference. Re-join all ID dataframes back together.
Is there a better way to approach this? How should I think about which version would be more efficient for large datasets?
Solution 1:[1]
If possible match by ID filtered rows by condition for compare by Baseline use:
m = df['Scenario'].eq('Baseline')
df1 = df[m].set_index('ID').select_dtypes(np.number)
df2 = df[~m]
df = (df2.join(df2.set_index('ID').select_dtypes(np.number)
.sub(df1)
.add_prefix('Difference with '), 'ID')
.drop(df1.columns, axis=1)
)
print (df)
Scenario ID Difference with Value
3 A 1 0
4 A 2 -5
5 A 3 4
Solution 2:[2]
considering the following dataframe -
>>> df = pd.DataFrame(
{'col1': ['Baseline','Baseline','Baseline', 'A', 'A','A'],
'col2': [1,2,3,1,2,3],
'col3': [10, 5, 1, 10, 0, 5]}
)
>>> print(df)
col1 col2 col3
0 Baseline 1 10
1 Baseline 2 5
2 Baseline 3 1
3 A 1 10
4 A 2 0
5 A 3 5
First you can create a pivot_table with col2
>>> df1 = df.pivot_table(index = 'col1', columns='col2', values='col3')
>>> print(df1)
col2 1 2 3
col1
A 10 0 5
Baseline 10 5 1
We can transpose df1 -
>>> df2 = df1.transpose()
>>> print(df2)
col1 A Baseline
col2
1 10 10
2 0 5
3 5 1
Now add the difference in a new column
>>> df2['diff'] = df2[df2.columns[0]] - df2[df2.columns[1]]
>>> print(df2)
col1 A Baseline diff
col2
1 10 10 0
2 0 5 -5
3 5 1 4
Combining all of them in one place -
>>> dfPivot = df.pivot_table(index = 'col1', columns='col2', values='col3').transpose()
>>> dfPivot['diff'] = dfPivot[dfPivot.columns[0]] - dfPivot[dfPivot.columns[1]]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Rajarshi Ghosh |
