'How to concatenate some columns with different data of two rows which are having same email address, using python?

I have to concatenate two or more rows of a csv file, which have same email id, but some columns will have varied values for the same email id in different rows. Adding sample i/p and sample required o/p file and what I have tried and what result I am getting.

sample i/p file:

enter image description here

email,real_email,policy,fr_details,fr_subduction,date
[email protected],[email protected],12345678,Covid_19 Questionnaire,Please provide Covid_19 questionnaire,10.1.2022
[email protected],[email protected],12345678,ID Proof,Photograph is not available,10.1.2022
[email protected],[email protected],12345678,Covid_19 Questionnaire,Please provide Covid_19 questionnaire,10.1.2022
[email protected],[email protected],12345678,Pan Card / Form 60,Pan Card / Form 60 is not Submitted,10.1.2022

sample required o/p:

enter image description here

"email","real_email","policy","fr_details","fr_subduction","fr_details1","fr_subduction1","date"
"[email protected]","[email protected]","12345678","Covid_19 Questionnaire","Please provide Covid_19 questionnaire","ID Proof","Photograph is not available","10.1.2022"
"[email protected]","[email protected]","12345678","Covid_19 Questionnaire","Please provide Covid_19 questionnaire","Pan Card / Form 60","Pan Card / Form 60 is not Submitted","10.1.2022"

(so for example; If [email protected] is repeated 3 times with 3 different fr_* values then output will have one more additional column: fr_details2, fr_subdesction2.)

code I have written:

import csv

def get_elem(x):
    x=x.tolist()
    if x[0]==x[1]:
        return x[0]
    else:
        y=x[0]+","+x[1]
        return y


df = pandas.read_csv('in.csv')
df1=df.groupby('email',as_index=False).agg(get_elem)
df1.reset_index(drop= True)
df1.to_csv('out.csv', sep=',', index=False, header=True, quoting=csv.QUOTE_ALL)

output for the above code:

enter image description here

"email","real_email","policy","fr_details","fr_subduction","date"
"[email protected]","[email protected]","12345678","Covid_19 Questionnaire,ID Proof","Please provide Covid_19 questionnaire,Photograph is not available","10.1.2022"
"[email protected]","[email protected]","12345678","Covid_19 Questionnaire,Pan Card / Form 60","Please provide Covid_19 questionnaire,Pan Card / Form 60 is not Submitted","10.1.2022"

with my code the output the fr_* columns are getting concatenated in the same column, but I need to separate these columns such that they are getting repeated as many times as they occurs in a sequence: fr_details(n), fr_subdesction(n), for the same email id



Solution 1:[1]

Filter the fr_ like columns, then drop the fr columns to get the remaining columns which can be used for grouping, now create a sequential counter using cumcount to identify unique rows per group, then pivot the dataframe to reshape, finally flatten the multiindex columns if required:

fr = df.filter(like='fr_')
keys = df.columns.drop(fr).tolist()

df['col'] = df.groupby(keys).cumcount().add(1).astype(str)
out = df.pivot(keys, 'col')
out.columns = out.columns.map('_'.join)

Result

>>> out
                                                          fr_details_1        fr_details_2                        fr_subduction_1                      fr_subduction_2
email       real_email      policy   date                                                                                                                             
[email protected] [email protected] 12345678 10.1.2022  Covid_19 Questionnaire            ID Proof  Please provide Covid_19 questionnaire          Photograph is not available
[email protected] [email protected] 12345678 10.1.2022  Covid_19 Questionnaire  Pan Card / Form 60  Please provide Covid_19 questionnaire  Pan Card / Form 68 is not Submitted

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Shubham Sharma