'Python dataframe group and create columns
I have CSV file it has 5 million lines, it is like this
id, name, house
1,a,house1,
1,aa,house2
1,aaa,house3
2,b,house4
2,bb,house5
2,bbb, house6
3,c, house7
3,cc, house8
3,ccc, house9
4,d, house10
4,dd, house11
4,ddd, house12
4,dddd, house13
I want to group them by id and for result, I need to create this
1,a,house1,aa,house2,aaa,houes3
2,b,house4,bb,house5,bbb,houes6
3,c,house7,cc,house8,ccc,houes9
4,d,house10,dd,house11,ddd,house12
also as you can see that I just try to get first 3,I have a little script which can do what I want, its too slow and not clean, so how can I do better,
import pandas as pd
import progressbar
df = pd.read_csv('test.csv')
print(df.columns)
df = df.reset_index()
l = []
bar = progressbar.ProgressBar(maxval=len(df.index) +1,
widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
print(len(df.index))
bar.start()
for i, row in df.iterrows():
tmp_sub_list = [row['id']]
bar.update(i+1)
for _, row2 in df.iterrows():
if row['id'] == row2['id']:
tmp_sub_list.append(row2['name'])
tmp_sub_list.append(row2['house'])
if len(tmp_sub_list) >= 7:
break
l.append(tmp_sub_list)
df_result = pd.DataFrame(l, columns=[
'id', 'name_first', 'house_first', 'name_second', 'house_second', 'name_threeth', 'house_threeth'])
bar.finish()
also, I am trying to use the progress bar,
Solution 1:[1]
Assuming dataframe is already sorted.
You can use groupby and head to take top 3 rows and aggregate into string list with join.
In the sample data, you have spaces for house columns so I added strip but this is the artifact of copy and paste, take out the strip part.
Also, there are spaces for columns, and if your original csv has space for columns, rename the column to no-space string before next code.
df = df.groupby('id').head(3).groupby('id').agg({
'name': lambda x: ','.join(x),
'house': lambda x: ','.join([y.strip() for y in x])
})
This makes the dataframe to
name house
id
1 a,aa,aaa house1,house2,house3
2 b,bb,bbb house4,house5,house6
3 c,cc,ccc house7,house8,house9
4 d,dd,ddd house10,house11,house12
You can use str.split to turns these to columns.
df[['name_first', 'name_second', 'name_third']] = df.name.str.split(',', expand=True)
df[['house_first', 'house_second', 'house_third']] = df.house.str.split(',', expand=True)
df = df.reset_index().drop(['name', 'house'], axis=1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Emma |
