'Python - Compare values in multiple rows and columns

I am new to Python, and I have a dataframe looks like below (see the raw data table).

The ultimate goal is to reconcile records (ie. id, rg, prd, and amount) from 2 system (sys1 vs sys2, which is labeled in column 'source').

How can i use 1 function (yes, 1 function please) to compare 2 rows of records:

  1. sort the table by (in this order) id and source
  2. if the id are same (e.g. row 0 and row 1, or row 2 & row 3), then compare the values in columns, 'RG', 'Prd', and 'amount'

** the below contains the new ask ** 3. create a new column ('compare_results') to label the comparison result a. values in all column are the same (amt diff within 1)
b. when the absolute value of difference in amount columns are greater than 1, then label it as "2_amt_" + any other column name + "diff" (examples, row 2-3 and rows 4-5) c. when values in columns other than the amount columns are diff, then label it as "3_" + column name + "diff" (examples, row 6-7)

(see the final data table)

raw data table:

row |source| id | RG | Prd | amt_1 | amt_2
0 | sys1 | 001yVwFGA| rg_01| Add On | 234577 | 235 1 | sys2 | 001yVwFGA| rg_01| Add On | 234577 | 235 2 | sys1 | 0020G6ECQ| rg_02| New | 5746 | 987 3 | sys2 | 0020G6ECQ| rg_01| New | 5789 | 987 4 | sys1 | 0020G6EKQ| rg_03| New | 95692 | 90 5 | sys2 | 0020G6EKQ| rg_03| Add On | 98692 | 90 6 | sys1 | 0C1vWwiwA| rg_01| New | 1673 | 50 7 | sys2 | 0C1vWwiwA| rg_02| Add On | 1673 | 50 8 | sys1 | 0S5555gBB| rg_01| Renewal | 5000 | 0 9 | sys1 | 0C1vWwiwA| rg_02| New | 73 | 100.0 10 | sys2 | 0C1vWwiwA| rg_02| New | 73 | 100.5

final data table (new request on May 3rd, 2022):

row |source| id | RG | Prd |amt_1 | amt_2 | compare_results 0 | sys1 | 001yVwFGA|rg_01| Add On|234577 | 235 | 1_same 1 | sys2 | 001yVwFGA|rg_01| Add On|234577 | 235 | 1_same 2 | sys1 | 0020G6ECQ|rg_02| New| 5746 | 987 | 2_amt1_amt2_rg_diff 3 | sys2 | 0020G6ECQ|rg_01| New| 5789 | 987 | 2_amt1_amt2_rg_diff 4 | sys1 | 0020G6EKQ|rg_03| New| 95692 | 90 | 2_amt1_Prd_diff 5 | sys2 | 0020G6EKQ|rg_03| Add On| 98692 | 90 | 2_amt1_Prd_diff 6 | sys1 | 0C1vWwiwA|rg_01| New| 1673 | 50 | 3_rg_prd_diff 7 | sys2 | 0C1vWwiwA|rg_02| Add On| 1673 | 50 | 3_rg_prd_diff 8 | sys1 | 0S5555gBB|rg_01|Renewal| 5000 | 0 | 5_unique 9 | sys1 | 0C1vWwiwA|rg_02| New| 73 | 100.0 | 1_same 10 | sys2 | 0C1vWwiwA|rg_02| New| 73 | 100.5 | 1_same

final data table (requested & answered on Mar 21, 2022):

row |source| id | RG | Prd | amount | compare_results

0 | sys1 | 001yVwFGA| rg_01| Add On | 234577 | 1_same
1 | sys2 | 001yVwFGA| rg_01| Add On | 234577 | 1_same
2 | sys1 | 0020G6ECQ| rg_02| New | 5746 | 2_rg_diff
3 | sys2 | 0020G6ECQ| rg_01| New | 5789 | 2_rg_diff
4 | sys1 | 0020G6EKQ| rg_03| New | 95692 | 3_prd_diff
5 | sys2 | 0020G6EKQ| rg_03| Add On | 98692 | 3_prd_diff
6 | sys1 | 0C1vWwiwA| rg_01| New | 1673 | 4_rg_prd_diff
7 | sys2 | 0C1vWwiwA| rg_02| Add On | 1673 | 4_rg_prd_diff
8 | sys1 | 0S5555gBB| rg_01| Renewal | 5000 | 5_unique



Solution 1:[1]

This will do what you're asking:

records = [
    {'row': 0, 'source': 'sys1', 'id': '001yVwFGA', 'RG': 'rg_01', 'Prd': 'Add On', 'amount': 234577},
    {'row': 1, 'source': 'sys2', 'id': '001yVwFGA', 'RG': 'rg_01', 'Prd': 'Add On', 'amount': 234577},
    {'row': 2, 'source': 'sys1', 'id': '0020G6ECQ', 'RG': 'rg_02', 'Prd': 'New', 'amount': 5746},
    {'row': 3, 'source': 'sys2', 'id': '0020G6ECQ', 'RG': 'rg_01', 'Prd': 'New', 'amount': 5789},
    {'row': 4, 'source': 'sys1', 'id': '0020G6EKQ', 'RG': 'rg_03', 'Prd': 'New', 'amount': 95692},
    {'row': 5, 'source': 'sys2', 'id': '0020G6EKQ', 'RG': 'rg_03', 'Prd': 'Add On', 'amount': 98692},
    {'row': 6, 'source': 'sys1', 'id': '0C1vWwiwA', 'RG': 'rg_01', 'Prd': 'New', 'amount': 1673},
    {'row': 7, 'source': 'sys2', 'id': '0C1vWwiwA', 'RG': 'rg_02', 'Prd': 'Add On', 'amount': 1673},
    {'row': 8, 'source': 'sys1', 'id': '0S5555gBB', 'RG': 'rg_01', 'Prd': 'Renewal', 'amount': 5000}
]
'''
Question asks how to:
1. sort the table by (in this order) id and source
2. if the id are same (e.g. row 0 and row 1, or row 2 & row 3), then compare the values in columns, 'RG', 'Prd', and 'amount'
3. create a new column ('compare_results') to label the comparison result (see the final data table)
'''

import pandas as pd
df = pd.DataFrame(records)

def reconcile(df):
    df.sort_values(by=['id', 'source'], inplace=True)
    grps = df.groupby(['id'])
    grpOffsetById = {}
    i = 0
    for id, _ in grps.groups.items():
        i = i + 1
        grpOffsetById[id] = i

    resultById = {}
    def compareIfPair(dfGrp):
        nRows = dfGrp.shape[0]
        row1 = dfGrp.iloc[0]
        i = grpOffsetById[row1['id']]
        diffStr = ''
        if nRows != 2:
            diffStr = str(i) + '_unique'
        else:
            row2 = dfGrp.iloc[1]
            diffList = []
            for col in ('RG', 'Prd', 'amount'):
                if row1[col] != row2[col]:
                    diffList.append(col.lower())
            if diffList:
                diffStr = str(i) + '_' + '_'.join(diffList) + '_diff'
            else:
                diffStr = str(i) + '_same'
        resultById[row1['id']] = diffStr
        return diffStr

    grps.apply(compareIfPair).reset_index()
    df['compare_results'] = df.apply(lambda x: resultById[x['id']], axis=1)
    return df

reconcile(df)
print(df.to_string(index=False))

Output:

 row source        id    RG     Prd  amount   compare_results
   0   sys1 001yVwFGA rg_01  Add On  234577            1_same
   1   sys2 001yVwFGA rg_01  Add On  234577            1_same
   2   sys1 0020G6ECQ rg_02     New    5746  2_rg_amount_diff
   3   sys2 0020G6ECQ rg_01     New    5789  2_rg_amount_diff
   4   sys1 0020G6EKQ rg_03     New   95692 3_prd_amount_diff
   5   sys2 0020G6EKQ rg_03  Add On   98692 3_prd_amount_diff
   6   sys1 0C1vWwiwA rg_01     New    1673     4_rg_prd_diff
   7   sys2 0C1vWwiwA rg_02  Add On    1673     4_rg_prd_diff
   8   sys1 0S5555gBB rg_01 Renewal    5000          5_unique

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 constantstranger