'Python - Compare values in multiple rows and columns
I am new to Python, and I have a dataframe looks like below (see the raw data table).
The ultimate goal is to reconcile records (ie. id, rg, prd, and amount) from 2 system (sys1 vs sys2, which is labeled in column 'source').
How can i use 1 function (yes, 1 function please) to compare 2 rows of records:
- sort the table by (in this order) id and source
- if the id are same (e.g. row 0 and row 1, or row 2 & row 3), then compare the values in columns, 'RG', 'Prd', and 'amount'
** the below contains the new ask **
3. create a new column ('compare_results') to label the comparison result
a. values in all column are the same (amt diff within 1)
b. when the absolute value of difference in amount columns are greater than 1, then label it as "2_amt_" + any other column name + "diff" (examples, row 2-3 and rows 4-5)
c. when values in columns other than the amount columns are diff, then label it as "3_" + column name + "diff" (examples, row 6-7)
(see the final data table)
raw data table:
row |source| id | RG | Prd | amt_1 | amt_2
0 | sys1 | 001yVwFGA| rg_01| Add On | 234577 | 235
1 | sys2 | 001yVwFGA| rg_01| Add On | 234577 | 235
2 | sys1 | 0020G6ECQ| rg_02| New | 5746 | 987
3 | sys2 | 0020G6ECQ| rg_01| New | 5789 | 987
4 | sys1 | 0020G6EKQ| rg_03| New | 95692 | 90
5 | sys2 | 0020G6EKQ| rg_03| Add On | 98692 | 90
6 | sys1 | 0C1vWwiwA| rg_01| New | 1673 | 50
7 | sys2 | 0C1vWwiwA| rg_02| Add On | 1673 | 50
8 | sys1 | 0S5555gBB| rg_01| Renewal | 5000 | 0
9 | sys1 | 0C1vWwiwA| rg_02| New | 73 | 100.0
10 | sys2 | 0C1vWwiwA| rg_02| New | 73 | 100.5
final data table (new request on May 3rd, 2022):
row |source| id | RG | Prd |amt_1 | amt_2 | compare_results 0 | sys1 | 001yVwFGA|rg_01| Add On|234577 | 235 | 1_same 1 | sys2 | 001yVwFGA|rg_01| Add On|234577 | 235 | 1_same 2 | sys1 | 0020G6ECQ|rg_02| New| 5746 | 987 | 2_amt1_amt2_rg_diff 3 | sys2 | 0020G6ECQ|rg_01| New| 5789 | 987 | 2_amt1_amt2_rg_diff 4 | sys1 | 0020G6EKQ|rg_03| New| 95692 | 90 | 2_amt1_Prd_diff 5 | sys2 | 0020G6EKQ|rg_03| Add On| 98692 | 90 | 2_amt1_Prd_diff 6 | sys1 | 0C1vWwiwA|rg_01| New| 1673 | 50 | 3_rg_prd_diff 7 | sys2 | 0C1vWwiwA|rg_02| Add On| 1673 | 50 | 3_rg_prd_diff 8 | sys1 | 0S5555gBB|rg_01|Renewal| 5000 | 0 | 5_unique 9 | sys1 | 0C1vWwiwA|rg_02| New| 73 | 100.0 | 1_same 10 | sys2 | 0C1vWwiwA|rg_02| New| 73 | 100.5 | 1_same
final data table (requested & answered on Mar 21, 2022):
row |source| id | RG | Prd | amount | compare_results
0 | sys1 | 001yVwFGA| rg_01| Add On | 234577 | 1_same
1 | sys2 | 001yVwFGA| rg_01| Add On | 234577 | 1_same
2 | sys1 | 0020G6ECQ| rg_02| New | 5746 | 2_rg_diff
3 | sys2 | 0020G6ECQ| rg_01| New | 5789 | 2_rg_diff
4 | sys1 | 0020G6EKQ| rg_03| New | 95692 | 3_prd_diff
5 | sys2 | 0020G6EKQ| rg_03| Add On | 98692 | 3_prd_diff
6 | sys1 | 0C1vWwiwA| rg_01| New | 1673 | 4_rg_prd_diff
7 | sys2 | 0C1vWwiwA| rg_02| Add On | 1673 | 4_rg_prd_diff
8 | sys1 | 0S5555gBB| rg_01| Renewal | 5000 | 5_unique
Solution 1:[1]
This will do what you're asking:
records = [
{'row': 0, 'source': 'sys1', 'id': '001yVwFGA', 'RG': 'rg_01', 'Prd': 'Add On', 'amount': 234577},
{'row': 1, 'source': 'sys2', 'id': '001yVwFGA', 'RG': 'rg_01', 'Prd': 'Add On', 'amount': 234577},
{'row': 2, 'source': 'sys1', 'id': '0020G6ECQ', 'RG': 'rg_02', 'Prd': 'New', 'amount': 5746},
{'row': 3, 'source': 'sys2', 'id': '0020G6ECQ', 'RG': 'rg_01', 'Prd': 'New', 'amount': 5789},
{'row': 4, 'source': 'sys1', 'id': '0020G6EKQ', 'RG': 'rg_03', 'Prd': 'New', 'amount': 95692},
{'row': 5, 'source': 'sys2', 'id': '0020G6EKQ', 'RG': 'rg_03', 'Prd': 'Add On', 'amount': 98692},
{'row': 6, 'source': 'sys1', 'id': '0C1vWwiwA', 'RG': 'rg_01', 'Prd': 'New', 'amount': 1673},
{'row': 7, 'source': 'sys2', 'id': '0C1vWwiwA', 'RG': 'rg_02', 'Prd': 'Add On', 'amount': 1673},
{'row': 8, 'source': 'sys1', 'id': '0S5555gBB', 'RG': 'rg_01', 'Prd': 'Renewal', 'amount': 5000}
]
'''
Question asks how to:
1. sort the table by (in this order) id and source
2. if the id are same (e.g. row 0 and row 1, or row 2 & row 3), then compare the values in columns, 'RG', 'Prd', and 'amount'
3. create a new column ('compare_results') to label the comparison result (see the final data table)
'''
import pandas as pd
df = pd.DataFrame(records)
def reconcile(df):
df.sort_values(by=['id', 'source'], inplace=True)
grps = df.groupby(['id'])
grpOffsetById = {}
i = 0
for id, _ in grps.groups.items():
i = i + 1
grpOffsetById[id] = i
resultById = {}
def compareIfPair(dfGrp):
nRows = dfGrp.shape[0]
row1 = dfGrp.iloc[0]
i = grpOffsetById[row1['id']]
diffStr = ''
if nRows != 2:
diffStr = str(i) + '_unique'
else:
row2 = dfGrp.iloc[1]
diffList = []
for col in ('RG', 'Prd', 'amount'):
if row1[col] != row2[col]:
diffList.append(col.lower())
if diffList:
diffStr = str(i) + '_' + '_'.join(diffList) + '_diff'
else:
diffStr = str(i) + '_same'
resultById[row1['id']] = diffStr
return diffStr
grps.apply(compareIfPair).reset_index()
df['compare_results'] = df.apply(lambda x: resultById[x['id']], axis=1)
return df
reconcile(df)
print(df.to_string(index=False))
Output:
row source id RG Prd amount compare_results
0 sys1 001yVwFGA rg_01 Add On 234577 1_same
1 sys2 001yVwFGA rg_01 Add On 234577 1_same
2 sys1 0020G6ECQ rg_02 New 5746 2_rg_amount_diff
3 sys2 0020G6ECQ rg_01 New 5789 2_rg_amount_diff
4 sys1 0020G6EKQ rg_03 New 95692 3_prd_amount_diff
5 sys2 0020G6EKQ rg_03 Add On 98692 3_prd_amount_diff
6 sys1 0C1vWwiwA rg_01 New 1673 4_rg_prd_diff
7 sys2 0C1vWwiwA rg_02 Add On 1673 4_rg_prd_diff
8 sys1 0S5555gBB rg_01 Renewal 5000 5_unique
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | constantstranger |