'Python dataframe group and create columns

I have CSV file it has 5 million lines, it is like this

id, name, house 
1,a,house1,
1,aa,house2
1,aaa,house3
2,b,house4
2,bb,house5
2,bbb, house6
3,c, house7
3,cc, house8
3,ccc, house9
4,d, house10
4,dd, house11
4,ddd, house12
4,dddd, house13

I want to group them by id and for result, I need to create this

1,a,house1,aa,house2,aaa,houes3
2,b,house4,bb,house5,bbb,houes6
3,c,house7,cc,house8,ccc,houes9
4,d,house10,dd,house11,ddd,house12

also as you can see that I just try to get first 3,I have a little script which can do what I want, its too slow and not clean, so how can I do better,

    import pandas as pd
    import progressbar

    df = pd.read_csv('test.csv')
    print(df.columns)
    df = df.reset_index()
    l = []
    bar = progressbar.ProgressBar(maxval=len(df.index) +1,
                                  widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    print(len(df.index))
    bar.start()

    for i, row in df.iterrows():
        tmp_sub_list = [row['id']]
        bar.update(i+1)
        for _, row2 in df.iterrows():
            if row['id'] == row2['id']:
                tmp_sub_list.append(row2['name'])
                tmp_sub_list.append(row2['house'])
            if len(tmp_sub_list) >= 7:
                break
        l.append(tmp_sub_list)

    df_result = pd.DataFrame(l, columns=[
        'id', 'name_first', 'house_first', 'name_second', 'house_second', 'name_threeth', 'house_threeth'])
    bar.finish()

also, I am trying to use the progress bar,



Solution 1:[1]

Assuming dataframe is already sorted.

You can use groupby and head to take top 3 rows and aggregate into string list with join.

In the sample data, you have spaces for house columns so I added strip but this is the artifact of copy and paste, take out the strip part.

Also, there are spaces for columns, and if your original csv has space for columns, rename the column to no-space string before next code.

df = df.groupby('id').head(3).groupby('id').agg({
         'name': lambda x: ','.join(x),
         'house': lambda x: ','.join([y.strip() for y in x])
     })

This makes the dataframe to

        name                    house
id
1   a,aa,aaa     house1,house2,house3
2   b,bb,bbb     house4,house5,house6
3   c,cc,ccc     house7,house8,house9
4   d,dd,ddd  house10,house11,house12

You can use str.split to turns these to columns.

df[['name_first', 'name_second', 'name_third']] = df.name.str.split(',', expand=True)
df[['house_first', 'house_second', 'house_third']] = df.house.str.split(',', expand=True)

df = df.reset_index().drop(['name', 'house'], axis=1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Emma