'Compare two files and output results according to conditionals
I need to compare two input files (file1.csv and file2.csv) and store the results and a third file (file3.csv).
The conditions are as follows:
- If entry matches file1 and file2, store it on file3
- If entry is found on file1, store it on file3
It looks like this:
input: file1.csv
"switch10"
"switch33"
"router51"
"switch6"
"router44"
"router12"
"switch2"
input: file2.csv
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"router44";"DatacenterC - cab2";"Test - Tenant2"
results: file3.csv
"router12"
"router44";"DatacenterC - cab2";"Test - Tenant2"
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"switch6"
"switch10"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
I have tried lots of combinations using 'awk'; however I cannot construct the conditionals.
I was wondering if someone could help me construct the conditional(s) to build this results file3.csv.
Assistance is much appreciated.
Solution 1:[1]
I would use GNU AWK for this task following way, let file1.txt content be
"switch10"
"switch33"
"router51"
"switch6"
"router44"
"router12"
"switch2"
and file2.txt content be
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"router44";"DatacenterC - cab2";"Test - Tenant2"
then
awk 'BEGIN{FS=";"}(NR==FNR){arr[$1]=$0}((NR!=FNR)&&($1 in arr)){arr[$1]=$0}END{for(i in arr){print arr[i]}}' file1.txt file2.txt
output
"switch6"
"router12"
"router44";"DatacenterC - cab2";"Test - Tenant2"
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch10"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
Disclaimer: I assume that you are able to accept any order of lines in output file. Explanation: I inform GNU AWK that ; is used as field separator (FS) then when processing first file (NR==FNR) I do fill array arr by putting whole current line ($0) under key which is content of first field ($1). When processing next files (NR!=FNR) if there is value corresponding to first field of line in array arr then I update its value to whole current line. After all files are processed I do print all values from array arr using for.
(tested in gawk 4.2.1)
Solution 2:[2]
If awk is not mandatory, you can use the great and simple Miller
Running
mlr --csv --fs ";" -N join --ul -j 1 -f input_01.csv then unsparsify then sort -f 1 input_02.csv
you have
router12;;
router44;DatacenterC - cab2;Test - Tenant2
router51;DatacenterA - cab1;Prod - Tenant12
switch10;;
switch2;DatacenterA - cab3;Dev - Tenant5
switch33;DatacenterB - cab14;Prod - Tenant4
switch6;;
Some notes:
--csv --fs ";"to set that the input and the output is a CSV in which the field separator is;-Nto set that the input and the output do not have header row;join --ul -j 1to run a left join based on the first column
Solution 3:[3]
Similar idea as @Daweo but:
- use the file names provided
- hard-code file names in script instead of arguments as they are not interchangeable
- yield result in requested order
You would save the script as task, chmod 755 task then run it as ./task > file3.csv:
#!/usr/bin/env -S awk -F; -f
BEGIN {
while ( (getline <"file2.csv") > 0 ) {
lookup[$1] = $0
}
while ( ("sort -k1.1,1.7 -k1.8n file1.csv" | getline) > 0 ) {
print ($0 in lookup) ? lookup[$0] : $0
}
}
Note: sort used here is fragile as it relies on fixed size prefix (router or switch) to identify the two parts of the compound key.
Using gawk's PROCINFO["sorted_in"] feature to return keys in a specify order and using a custom compare function str_num_cmp() makes it a bit more robust:
#!/usr/bin/env -S awk -F; -f
function str_num_to_a(i, a) {
match(i, /"([^0-9]*)([0-9]+)"/, a)
}
# a1/a2 are specified as arguments to make them local variables
function str_num_cmp(i1, v1, i2, v2, a1, a2) {
str_num_to_a(i1, a1)
str_num_to_a(i2, a2)
if(a1[1] < a2[1])
return -1
if(a1[1] == a2[1]) {
if(a1[2] < a2[2])
return -1
if(a1[2] == a2[2])
return 0
}
return 1
}
BEGIN {
while ( (getline <"file1.csv") > 0 ) {
keys[$1] = ""
}
while ( (getline <"file2.csv") > 0 ) {
lookup[$1] = $0
}
PROCINFO["sorted_in"] = "str_num_cmp"
for (k in keys) {
print (k in lookup) ? lookup[k] : k
}
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Daweo |
| Solution 2 | aborruso |
| Solution 3 |
