'Compare 2 csv files and remove the common lines from 1st file | python

I want to compare 2 csv files master.csv and exclude.csv and remove all the matching lines based on column1 and write the final output in mater.csv file.

master.csv

abc,xyz
cde,fgh
ijk,lmn

exclude.csv

###Exclude list###
cde
####

Expected output (it should overwrite master.csv

abc,xyz
ijk,lmn

Tried till now

with open('exclude.csv','r') as in_file, open('master.csv','w') as out_file:
    seen = set()
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)


Solution 1:[1]

I believe there should be some pandas or other modules approaches, but here is a pure pythonic approach:

with open("master.csv") as f:
  master = f.read()
with open("exclude.csv") as f:
  exclude = f.read()

master = master.strip().split("\n")
exclude = exclude.strip().split("\n")

returnList = []
for line in master:
  check = True
  for exc in exclude:
    if exc in line:
      check = False
      break
  if check:
    returnList.append(line)

with open("master.csv", "w") as f:
  f.write("\n".join(returnList))

Output of master.csv

abc,xyz
ijk,lmn

Solution 2:[2]

Simplest way by using pandas..

    import pandas as pd 
# Reading the csv file
df_new = pd.read_csv('Names.csv')
 
# saving xlsx file
GFG = pd.ExcelWriter('Names.xlsx')
df_new.to_excel(GFG, index=False)
 
GFG.save()

Solution 3:[3]

A purely pythonic answer leveraging list comprehensions:

with open('master.csv', 'r') as f:
    keep_lines = f.readlines()

with open('exclude.csv', 'r') as f:
    drop_lines = f.readlines()

write_lines = [line[0] for line in zip(keep_lines, drop_lines) if line[0].strip().split(',')[0] != line[1].strip()]

with open('master.csv', 'w') as f:
    f.writelines(write_lines)

Solution 4:[4]

You can use pandas like this:

import pandas as pd

master_df=pd.read_csv('master.csv')
exclude_df=pd.read_csv('exclude.csv')
conc=pd.concat([master_df,exclude_df])  #concatenate two dataframe
conc.drop_duplicates(subset=['col1'],inplace=True,keep=False)
print(conc)

drop_duplicates with subset = col1 will check for duplicate in col1 only

and keep has 3 values allowed:first,last and False... i have chosen keep=False to not keep any duplicate

Dataset:

master.csv:

col1,col2
abc,xyz
cde,fgh
ijk,lmn

exclude.csv:

col1
cde

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Sameer Gaikwad
Solution 3 Michael
Solution 4 Tomato Master