'Pandas data frame calculate difference from first rows

I have a dataset where I have the results of a model under many different scenarios and I want to capture the difference between each scenario and the baseline.

Example input data:

Scenario ID Value
Baseline 1 10
Baseline 2 5
Baseline 3 1
A 1 10
A 2 0
A 3 5

Desired result:

Scenario ID Difference from Baseline
A 1 0
A 2 -5
A 3 4

I have done this in the past manually in cases where there are 1-2 values which I want to subtract, but now I want to be able to do this for a variable number of columns. I think there will be 20-40 "Value" columns I want to subtract, anywhere from 2-50 unique IDs, and 60+ scenarios.

I can think of two approaches to this but both seem inefficient:

  1. Start by separating into "baseline" and "all_scenarios" dfs. Get list of columns. Loop through each column in baseline df to rename. Join dataframes on ID. Loop through column list to perform subtraction.
  2. Start by separating into "baseline" and "all_scenarios" dfs. Get unique list of IDs. Loop through each ID and create a new baseline and all_scenario dfs with only that ID. Use df.subtract() to get the difference. Re-join all ID dataframes back together.

Is there a better way to approach this? How should I think about which version would be more efficient for large datasets?



Solution 1:[1]

If possible match by ID filtered rows by condition for compare by Baseline use:

m = df['Scenario'].eq('Baseline')
df1 = df[m].set_index('ID').select_dtypes(np.number)
df2 = df[~m]

df = (df2.join(df2.set_index('ID').select_dtypes(np.number)
                  .sub(df1)
                  .add_prefix('Difference with '), 'ID')
                  .drop(df1.columns, axis=1)
         )
print (df)
  Scenario  ID  Difference with Value
3        A   1                      0
4        A   2                     -5
5        A   3                      4

Solution 2:[2]

considering the following dataframe -

>>> df = pd.DataFrame(
    {'col1': ['Baseline','Baseline','Baseline', 'A', 'A','A'],
    'col2': [1,2,3,1,2,3],
    'col3': [10, 5, 1, 10, 0, 5]}
)
>>> print(df)
col1  col2  col3
0  Baseline     1    10
1  Baseline     2     5
2  Baseline     3     1
3         A     1    10
4         A     2     0
5         A     3     5

First you can create a pivot_table with col2

>>> df1 = df.pivot_table(index = 'col1', columns='col2', values='col3')
>>> print(df1)
col2       1  2  3
col1              
A         10  0  5
Baseline  10  5  1

We can transpose df1 -

>>> df2 = df1.transpose()
>>> print(df2)
col1   A  Baseline
col2              
1     10        10
2      0         5
3      5         1

Now add the difference in a new column

>>> df2['diff'] = df2[df2.columns[0]] - df2[df2.columns[1]]
>>> print(df2)

col1   A  Baseline  diff
col2                    
1     10        10     0
2      0         5    -5
3      5         1     4

Combining all of them in one place -

>>> dfPivot = df.pivot_table(index = 'col1', columns='col2', values='col3').transpose()
>>> dfPivot['diff'] = dfPivot[dfPivot.columns[0]] - dfPivot[dfPivot.columns[1]]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Rajarshi Ghosh