'Pandas: Three listed columns to wide format
How to expand a set of columns using the first column's values as headers for the other columns?
For example:
x = pd.DataFrame({'id':[11,998,3923], 'count':[7,7,7],
'attributes':['VIS,TEMP,MIN','MIN,VIS,TEMP','MIN,VIS'],
'attribute_values':['0,4,2','2,3,0','0,9'],
'attribute_years':['2000,2001,2002','2001,2002,2003','2008,2009']})
(Edit: note that attributes could be out of order or missing.)
| index | id | count | attributes | attribute_values | attribute_years |
|---|---|---|---|---|---|
| 0 | 11 | 7 | VIS,TEMP,MIN | 0,4,2 | 2000,2001,2002 |
| 1 | 998 | 7 | MIN,VIS,TEMP | 2,3,0 | 2001,2002,2003 |
| 2 | 3923 | 7 | MIN,VIS | 0,9 | 2008,2009 |
In this case, the attributes column values should used to make new columns with attribute_values and attribute_years columns.
Ideal output:
| index | id | count | attribute_values_VIS | attribute_values_TEMP | attribute_values_MIN | attribute_years_VIS | attribute_years_TEMP | attribute_years_MIN |
|---|---|---|---|---|---|---|---|---|
| 0 | 11 | 7 | 0 | 4 | 2 | 2000 | 2001 | 2002 |
| 1 | 998 | 7 | 3 | 0 | 2 | 2002 | 2003 | 2001 |
| 2 | 3923 | 7 | 9 | NaN | 0 | 2009 | NaN | 2008 |
Solution 1:[1]
I guess there is no native function in pandas to split strings and create new columns from them. However, you can easily write your own little function. Assuming your example
import pandas as pd
x = pd.DataFrame({'id': [1,2,3],
'attributes': ['VIS,TEMP,MIN','MIN,VIS,TEMP','MIN,VIS'],
'attribute_values': ['0,4,2','2,3,0','0,9'],
'attribute_years': ['2000,2001,2002','2001,2002,2003','2008,2009']})
You can just loop over the rows, split the strings in the columns x['attributes'], x['attribute_values'], and x['attribute_years'], split the strings (always assuming that we don't need to fear different length and that ',' indicates the character to split the strings) and create a new dictionary. Collecting all dictionaries with the new key-value pairs, you can just build a new pandas.DataFrame and assign it to the original if you like:
data = []
for i, row in x.iterrows():
# extract data
att = row['attributes'].split(',')
val = list(map(int, row['attribute_values'].split(',')))
yrs = list(map(int, row['attribute_years'].split(',')))
# create new dictionaries
row_new = {f'attribute_values_{a}': v for a, v in zip(att, val)}
row_new.update({f'attribute_years_{a}': y for a, y in zip(att, yrs)})
# concatenate dictionaries and append to list
data.append(row_new)
# create table from list of rows
pd.DataFrame(data, index=x.index)
output:
| attribute_values_VIS | attribute_values_TEMP | attribute_values_MIN | attribute_years_VIS | attribute_years_TEMP | attribute_years_MIN |
|---|---|---|---|---|---|
| 0 | 0 | 4.0 | 2 | 2000 | 2001.0 |
| 1 | 3 | 0.0 | 2 | 2002 | 2003.0 |
| 2 | 9 | NaN | 0 | 2009 | NaN |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
