'How do I combine two datasets in pandas and keep unique rows only?

I have a dataset with a product list which, everytime a customer does a purchase, adds a row with new information. As a result I have rows with the same customer number occuring multiple times as they purchase multiple products. I have managed to create a new df which is grouped per customer number, therefor decreasing the dataset size from 16.000 to around 3.000. Now When I want to combine the two, I want to keep the grouped by party numbers, to keep the data organized. But for some reason it keeps on getting back to the 16.000.

my code is as follows:

#Create pandas dataframe with the usefull variables
prod = pd.DataFrame(pt[['Party Nbr', 'Product Nm', 'Category Desc','Group Desc']])

#Add column with total amount of products
prod['Product Cnt'] = prod.groupby('Party Nbr')['Party Nbr'].transform('count')

And this is a sample of rows from the result:

Party Nbr	Product Nm	Category Desc	Group Desc	Product Cnt
79695728.0	Betaalpas	Betaaldiensten	Pas	14
79741169.0	ING Business Card	Betaaldiensten	Creditcard	21
79907032.0	Mijn ING.nl	Betaaldiensten	Beheerfaciliteit	4
80139442.0	Zakelijke Oranje Spaarrekening	Sparen	Giraal sparen	7
80193730.0	PIN Pakket	Betaaldiensten	Betaalfaciliteit	5

with 16.000 rows

Then I grouped on party number to get a grouped categories column like this

pf = prod.groupby(['Party Nbr'])['Category Desc'].apply(list).reset_index().rename(columns= 
{'Category Desc': 'Categories'})
pf['Categories'] = pf['Categories'].apply(set).apply(tuple)

Giving me this with 3.000 rows

Party Nbr	Categories
79687857.0	(Betaaldiensten, Sparen)
79687954.0	(nan, Betaaldiensten, Sparen)
79688233.0	(Betaaldiensten,)
79688438.0	(Betaaldiensten, Sparen)
79688845.0	(Betaaldiensten, Sparen)

How can I combine the two and keep the party number selection from the second table?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How do I combine two datasets in pandas and keep unique rows only?

Sources

Related Questions