'Pivot wider to longer with value and unit (currency) columns

My data is in wide format as such:

data = [{
id: '1',
'timestamp': '2021-10-01',
'product_type' : 'Quarterly',
'applicable_tariff_per_local_currency_kwh_d_value' : 1.11,
'applicable_tariff_per_local_currency_kwh_d_unit' : 'CZK/(kWh/d)/q',

'applicable_tariff_per_local_currency_kwh_h_value' : 11.11,
'applicable_tariff_per_local_currency_kwh_h_unit' : 'CZK/(kWh/h)/q',

'applicable_tariff_per_eur_kwh_h_value' : 0.1111,
'applicable_tariff_per_eur_kwh_h_unit' : 'Euro/(kWh/h)/q',

'applicable_tariff_per_eur_kwh_d_value' : 0.0011,
'applicable_tariff_per_eur_kwh_d_unit' : 'Euro/(kWh/d)/q',

'applicable_tariff_in_common_unit_value' : 0.0001111,
'applicable_tariff_in_common_unit_unit' : 'Euro/(kWh/h)/d'
},
{
'id': '2',
'timestamp': '2021-10-01',
'product_type' : 'Quarterly',
'applicable_tariff_per_local_currency_kwh_d_value' : 2.22,
'applicable_tariff_per_local_currency_kwh_d_unit' : 'CZK/(kWh/d)/q',

'applicable_tariff_per_local_currency_kwh_h_value' : 22.22,
'applicable_tariff_per_local_currency_kwh_h_unit' : 'CZK/(kWh/h)/q',

'applicable_tariff_per_eur_kwh_h_value' : 0.2222,
'applicable_tariff_per_eur_kwh_h_unit' : 'Euro/(kWh/h)/q',

'applicable_tariff_per_eur_kwh_d_value' : 0.0022,
'applicable_tariff_per_eur_kwh_d_unit' : 'Euro/(kWh/d)/q',

'applicable_tariff_in_common_unit_value' : 0.0002222,
'applicable_tariff_in_common_unit_unit' : 'Euro/(kWh/h)/d'
}
]

df = pd.DataFrame(data)

I would like to pivot longer this data into the following specified format:

pivot_data = [{
'id': '1',
'timestamp': '2021-10-01',
'product_type': 'Quarterly',
'currency': 'CZK', # Can be obtained through regex on code
'unit': 'kWh/h', # Can be obtained through regex on code
'product_code': 'q', # Can be obtained through regex on code
'code': 'CZK/(kWh/d)/q',
'value': 1.11
},
'id': '1',
'timestamp': '2021-10-01',
'product_type': 'Quarterly',
'currency': 'CZK', 
'unit': 'kWh/h',
'product_code': 'q', 
'code': 'CZK/(kWh/h)/q',
'value': 11.11
},
'id': '1',
'timestamp': '2021-10-01',
'product_type': 'Quarterly',
'currency': 'Euro',
'unit': 'kWh/h', 
'product_code': 'q', 
'code': 'Euro/(kWh/h)/q',
'value': 0.1111
},
'id': '1',
'timestamp': '2021-10-01',
'product_type': 'Quarterly',
'currency': 'Euro', 
'unit': 'kWh/d', 
'product_code': 'q', 
'code': 'Euro/(kWh/d)/q',
'value': 0.001111
},
# This one is extra and not super neccesary...
'id': '1',
'timestamp': '2021-10-01',
'product_type': 'common_unit',
'currency': 'Euro', 
'unit': 'kWh/h', 
'product_code': 'd', 
'code': 'Euro/(kWh/d)/d',
'value': 0.001111
}]

df_pivoted = pd.DataFrame(pivot_data)

I know I can do this with pd.pivot_wider_to_longer() or pandas.merge, however I cannot seem to get it to work. Is anyone able to shed more light on this? I've added this example as it was the data I am struggling with, however if you have a more simple dataset you could show me an example of this as well!

I did not add id 2 as it would be too much data in the pivoted_longer dataset.

Thanks in advance.



Solution 1:[1]

Because you require two related columns (value, unit) in the output, you need to run melt twice:

In     : d1 = pd.melt(
    ...:     df,
    ...:     id_vars=["id"],
    ...:     value_vars=[
    ...:         "applicable_tariff_per_local_currency_kwh_d_value",
    ...:         "applicable_tariff_per_local_currency_kwh_h_value",
    ...:         "applicable_tariff_per_eur_kwh_h_value",
    ...:         "applicable_tariff_per_eur_kwh_d_value",
    ...:         "applicable_tariff_in_common_unit_value",
    ...:     ],
    ...: )

In     : d1.head(n=2)
Out    :
  id                                          variable  value
0  1  applicable_tariff_per_local_currency_kwh_d_value   1.11
1  2  applicable_tariff_per_local_currency_kwh_d_value   2.22

and same thing for for the unit columns:

In     : d2 = pd.melt(
    ...:     df,
    ...:     id_vars=["id"],
    ...:     value_vars=[
    ...:         "applicable_tariff_per_local_currency_kwh_d_unit",
    ...:         "applicable_tariff_per_local_currency_kwh_h_unit",
    ...:         "applicable_tariff_per_eur_kwh_h_unit",
    ...:         "applicable_tariff_per_eur_kwh_d_unit",
    ...:         "applicable_tariff_in_common_unit_unit",
    ...:     ],
    ...: )

You can now merge these two dataframes to get to your desired result:

d1["variable"]= d1["variable"].str.replace("_value", "")
d2["variable"] = d2["variable"].str.replace("_unit$", "", regex=True)
d2 = d2.rename(columns={"value": "code"})
df_pivoted = d1.merge(d2, on=["id", "variable"]).drop(columns=["variable"])

For the rest you can use pandas string operations to split out the code columns and you can merge with the original dataframe to get the timestamp and product_type info back but this should be clear enough.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1