'Find differences between a set of csv files in folder 1 against a set of csv files in folder 2?
There are a number a files that need to be compared for differences in their rows; difference not as in subtraction but as in what values are different for each row of a column. Each file is named with its timestamp in it along with information about client as follows: Timecard-MC1010-19-20220507140122-Reported. All the files have a group of columns that are similar for both groups of files. The groups are labeled OlderVersion and NewVersion with a large number of files each. The files are named by their timestamp and in the output when one is showing which record is different, I need to display the timestamp for the file along with the information of the rows that are different for each file as follows: So let's say I have two folders - Folder A and Folder B, with two files in A and one in B. After running the program it should display the following: Folder A had a record in file "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341
I have managed to show in concatenated dataframe the differenecs in files using the following code. But I don't know how to extract the highlighted differences except for the NAN fields and the timestamp fields from the dataframe. After running the program it should display the following: Folder A had a record in file "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341
import pandas as pd
import os
path1 = r"C:\\Users\\Bilal\\Python\\Task1\\NewVersionFiles\\"
path2 = r"C:\\Users\\Bilal\\Python\\Task1\\OlderVersionFiles\\"
files1 = [os.path.join(path1, x) for x in os.listdir(path1) if '.csv' in str(x)]
files2 = [os.path.join(path2, x) for x in os.listdir(path2) if '.csv' in str(x)]
li1= []
li2 = []
for filename in files1:
df1 = pd.read_csv(filename, index_col=None, header=0)
li1.append(df1)
frame1 = pd.concat(li, axis=0, ignore_index=True)
for filename in files2:
df2 = pd.read_csv(filename, index_col=None, header=0)
li2.append(df2)
frame2 = pd.concat(li, axis=0, ignore_index=True)
#here 45844 is the len of files in folder A
seq = list(range(45844))
frame['idx'] = seq
frame.head()
#here 29166 is the len of files in folder B
seq2 = list(range(29166))
frame2['idx'] = seq2
frame2.head()
#Then I add index column for both dataframes
# shift column 'Name' to first position
first_column = frame2.pop('idx')
# insert column using insert(position,column_name,
# first_column) function
frame2.insert(0, 'idx', first_column)
frame2.head()
# shift column 'Name' to first position
first_column = frame.pop('idx')
# insert column using insert(position,column_name,
# first_column) function
frame.insert(0, 'idx', first_column)
frame.head()
import numpy as np
df_all = pd.concat([frame.set_index('idx'), frame2.set_index('idx')],
axis='columns', keys=['First', 'Second'])
df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final
def highlight_diff(data, color='yellow'):
attr = 'background-color: {}'.format(color)
other = data.xs('First', axis='columns', level=-1)
return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
index=data.index, columns=data.columns)
df_final.style.apply(highlight_diff, axis=None)
Now this code highlights the differences between the columns for First and Second Folder files. However, I am trying to extract the information instead of highlighting on dataframe. What code can I write to display the following: "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|