'How can I combine these database rows using Python Pandas?
Inside this .csv file, there is information on Target stores. However, each Target store has 2 accounts, you can distinguish it based on the 071 and 212 at the beginning of the Sold-to Name. You can see for example, Super Target 2169 has 2 rows. I want to clean this sheet up for report purposes and condense both Super Target 2169 rows into 1 Uniquely Identified row, that combined both dollar amounts to be able to easily see how much money that store has done overall.
I have tried iterating even though I was told not to in Pandas. I have also tried using the groupby function, but I just cannot manipulate it in the way I want it. I am not expecting someone to tell me the code, but just a point in the right direction or function in which I can start researching would be amazing.
Solution 1:[1]
You'd want to use groupby, but omit the prefix for each store (so that the actual pair of store rows can be combined). Consider a simplified version of your dataframe:
df = pd.DataFrame({'a':['071 SPR 2169', '071 SPR 21690', '212 SPR 2169', '212 SPR 21690'], 'b':[21.21, 23.21,22.22, 33.43]})
Just stores and sales. You can split out the store prefix and the store name like this:
df[['a1', 'a2']] = df.apply(lambda x: pd.Series([x.a.split(' ')[0], ' '.join(x.a.split(' ')[1:])]), axis=1)
Leaving you with this:
a b a1 a2
0 071 SPR 2169 21.21 071 SPR 2169
1 071 SPR 21690 23.21 071 SPR 21690
2 212 SPR 2169 22.22 212 SPR 2169
3 212 SPR 21690 33.43 212 SPR 21690
Then, groupby on the store name only, using aggregations to sum your sales:
total_df = df.groupby('a2')['b'].sum()
Giving you:
a2
SPR 2169 43.43
SPR 21690 56.64
If you need to include your store information, you can use aggregations with something like:
total_df = df.groupby('a2').agg({'b':'sum', 'city':'first', 'state':'first'})
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | AlecZ |
