'Find differences between a set of csv files in folder 1 against a set of csv files in folder 2?

There are a number a files that need to be compared for differences in their rows; difference not as in subtraction but as in what values are different for each row of a column. Each file is named with its timestamp in it along with information about client as follows: Timecard-MC1010-19-20220507140122-Reported. All the files have a group of columns that are similar for both groups of files. The groups are labeled OlderVersion and NewVersion with a large number of files each. The files are named by their timestamp and in the output when one is showing which record is different, I need to display the timestamp for the file along with the information of the rows that are different for each file as follows: So let's say I have two folders - Folder A and Folder B, with two files in A and one in B. After running the program it should display the following: Folder A had a record in file "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341

I have managed to show in concatenated dataframe the differenecs in files using the following code. But I don't know how to extract the highlighted differences except for the NAN fields and the timestamp fields from the dataframe. After running the program it should display the following: Folder A had a record in file "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341

import pandas as pd
import os

path1 = r"C:\\Users\\Bilal\\Python\\Task1\\NewVersionFiles\\"
path2 = r"C:\\Users\\Bilal\\Python\\Task1\\OlderVersionFiles\\"
files1 = [os.path.join(path1, x) for x in os.listdir(path1) if '.csv' in str(x)]
files2 = [os.path.join(path2, x) for x in os.listdir(path2) if '.csv' in str(x)]

li1= []
li2 = []

for filename in files1:
    df1 = pd.read_csv(filename, index_col=None, header=0)
    li1.append(df1)
frame1 = pd.concat(li, axis=0, ignore_index=True)

for filename in files2:
    df2 = pd.read_csv(filename, index_col=None, header=0)
    li2.append(df2)   
frame2 = pd.concat(li, axis=0, ignore_index=True)

#here 45844 is the len of files in folder A
seq = list(range(45844))
frame['idx'] = seq
frame.head()

#here 29166 is the len of files in folder B
seq2 = list(range(29166))
frame2['idx'] = seq2
frame2.head()

#Then I add index column for both dataframes
# shift column 'Name' to first position
first_column = frame2.pop('idx')
# insert column using insert(position,column_name,
# first_column) function
frame2.insert(0, 'idx', first_column)
frame2.head()

# shift column 'Name' to first position
first_column = frame.pop('idx')
  
# insert column using insert(position,column_name,
# first_column) function
frame.insert(0, 'idx', first_column)
frame.head()

import numpy as np
df_all = pd.concat([frame.set_index('idx'), frame2.set_index('idx')], 
                   axis='columns', keys=['First', 'Second'])

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final

def highlight_diff(data, color='yellow'):
    attr = 'background-color: {}'.format(color)
    other = data.xs('First', axis='columns', level=-1)
    return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
                        index=data.index, columns=data.columns)

df_final.style.apply(highlight_diff, axis=None)

Now this code highlights the differences between the columns for First and Second Folder files. However, I am trying to extract the information instead of highlighting on dataframe. What code can I write to display the following: "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source