'How to concatenate some columns with different data of two rows which are having same email address, using python?
I have to concatenate two or more rows of a csv file, which have same email id, but some columns will have varied values for the same email id in different rows. Adding sample i/p and sample required o/p file and what I have tried and what result I am getting.
sample i/p file:
email,real_email,policy,fr_details,fr_subduction,date
[email protected],[email protected],12345678,Covid_19 Questionnaire,Please provide Covid_19 questionnaire,10.1.2022
[email protected],[email protected],12345678,ID Proof,Photograph is not available,10.1.2022
[email protected],[email protected],12345678,Covid_19 Questionnaire,Please provide Covid_19 questionnaire,10.1.2022
[email protected],[email protected],12345678,Pan Card / Form 60,Pan Card / Form 60 is not Submitted,10.1.2022
sample required o/p:
"email","real_email","policy","fr_details","fr_subduction","fr_details1","fr_subduction1","date"
"[email protected]","[email protected]","12345678","Covid_19 Questionnaire","Please provide Covid_19 questionnaire","ID Proof","Photograph is not available","10.1.2022"
"[email protected]","[email protected]","12345678","Covid_19 Questionnaire","Please provide Covid_19 questionnaire","Pan Card / Form 60","Pan Card / Form 60 is not Submitted","10.1.2022"
(so for example; If [email protected] is repeated 3 times with 3 different fr_* values then output will have one more additional column: fr_details2, fr_subdesction2.)
code I have written:
import csv
def get_elem(x):
x=x.tolist()
if x[0]==x[1]:
return x[0]
else:
y=x[0]+","+x[1]
return y
df = pandas.read_csv('in.csv')
df1=df.groupby('email',as_index=False).agg(get_elem)
df1.reset_index(drop= True)
df1.to_csv('out.csv', sep=',', index=False, header=True, quoting=csv.QUOTE_ALL)
output for the above code:
"email","real_email","policy","fr_details","fr_subduction","date"
"[email protected]","[email protected]","12345678","Covid_19 Questionnaire,ID Proof","Please provide Covid_19 questionnaire,Photograph is not available","10.1.2022"
"[email protected]","[email protected]","12345678","Covid_19 Questionnaire,Pan Card / Form 60","Please provide Covid_19 questionnaire,Pan Card / Form 60 is not Submitted","10.1.2022"
with my code the output the fr_* columns are getting concatenated in the same column, but I need to separate these columns such that they are getting repeated as many times as they occurs in a sequence: fr_details(n), fr_subdesction(n), for the same email id
Solution 1:[1]
Filter the fr_ like columns, then drop the fr columns to get the remaining columns which can be used for grouping, now create a sequential counter using cumcount to identify unique rows per group, then pivot the dataframe to reshape, finally flatten the multiindex columns if required:
fr = df.filter(like='fr_')
keys = df.columns.drop(fr).tolist()
df['col'] = df.groupby(keys).cumcount().add(1).astype(str)
out = df.pivot(keys, 'col')
out.columns = out.columns.map('_'.join)
Result
>>> out
fr_details_1 fr_details_2 fr_subduction_1 fr_subduction_2
email real_email policy date
[email protected] [email protected] 12345678 10.1.2022 Covid_19 Questionnaire ID Proof Please provide Covid_19 questionnaire Photograph is not available
[email protected] [email protected] 12345678 10.1.2022 Covid_19 Questionnaire Pan Card / Form 60 Please provide Covid_19 questionnaire Pan Card / Form 68 is not Submitted
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Shubham Sharma |



