'Pandas: Three listed columns to wide format

How to expand a set of columns using the first column's values as headers for the other columns?

For example:

x = pd.DataFrame({'id':[11,998,3923], 'count':[7,7,7],
  'attributes':['VIS,TEMP,MIN','MIN,VIS,TEMP','MIN,VIS'],
  'attribute_values':['0,4,2','2,3,0','0,9'],
  'attribute_years':['2000,2001,2002','2001,2002,2003','2008,2009']})

(Edit: note that attributes could be out of order or missing.)

index id count attributes attribute_values attribute_years
0 11 7 VIS,TEMP,MIN 0,4,2 2000,2001,2002
1 998 7 MIN,VIS,TEMP 2,3,0 2001,2002,2003
2 3923 7 MIN,VIS 0,9 2008,2009

In this case, the attributes column values should used to make new columns with attribute_values and attribute_years columns.

Ideal output:

index id count attribute_values_VIS attribute_values_TEMP attribute_values_MIN attribute_years_VIS attribute_years_TEMP attribute_years_MIN
0 11 7 0 4 2 2000 2001 2002
1 998 7 3 0 2 2002 2003 2001
2 3923 7 9 NaN 0 2009 NaN 2008


Solution 1:[1]

I guess there is no native function in pandas to split strings and create new columns from them. However, you can easily write your own little function. Assuming your example

import pandas as pd

x = pd.DataFrame({'id': [1,2,3],
'attributes': ['VIS,TEMP,MIN','MIN,VIS,TEMP','MIN,VIS'],
'attribute_values': ['0,4,2','2,3,0','0,9'],
'attribute_years': ['2000,2001,2002','2001,2002,2003','2008,2009']})

You can just loop over the rows, split the strings in the columns x['attributes'], x['attribute_values'], and x['attribute_years'], split the strings (always assuming that we don't need to fear different length and that ',' indicates the character to split the strings) and create a new dictionary. Collecting all dictionaries with the new key-value pairs, you can just build a new pandas.DataFrame and assign it to the original if you like:

data = []
for i, row in x.iterrows():
    # extract data
    att = row['attributes'].split(',')
    val = list(map(int, row['attribute_values'].split(',')))
    yrs = list(map(int, row['attribute_years'].split(',')))
    # create new dictionaries
    row_new = {f'attribute_values_{a}': v for a, v in zip(att, val)}
    row_new.update({f'attribute_years_{a}': y for a, y in zip(att, yrs)})
    # concatenate dictionaries and append to list
    data.append(row_new)
# create table from list of rows
pd.DataFrame(data, index=x.index)

output:

attribute_values_VIS attribute_values_TEMP attribute_values_MIN attribute_years_VIS attribute_years_TEMP attribute_years_MIN
0 0 4.0 2 2000 2001.0
1 3 0.0 2 2002 2003.0
2 9 NaN 0 2009 NaN

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1