'Compare two files and output results according to conditionals

I need to compare two input files (file1.csv and file2.csv) and store the results and a third file (file3.csv).

The conditions are as follows:

  1. If entry matches file1 and file2, store it on file3
  2. If entry is found on file1, store it on file3

It looks like this:

input: file1.csv

"switch10"
"switch33"
"router51"
"switch6"
"router44"
"router12"
"switch2"

input: file2.csv

"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"router44";"DatacenterC - cab2";"Test - Tenant2"

results: file3.csv

"router12"
"router44";"DatacenterC - cab2";"Test - Tenant2"
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"switch6"
"switch10"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"

I have tried lots of combinations using 'awk'; however I cannot construct the conditionals.

I was wondering if someone could help me construct the conditional(s) to build this results file3.csv.

Assistance is much appreciated.



Solution 1:[1]

I would use GNU AWK for this task following way, let file1.txt content be

"switch10"
"switch33"
"router51"
"switch6"
"router44"
"router12"
"switch2"

and file2.txt content be

"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"router44";"DatacenterC - cab2";"Test - Tenant2"

then

awk 'BEGIN{FS=";"}(NR==FNR){arr[$1]=$0}((NR!=FNR)&&($1 in arr)){arr[$1]=$0}END{for(i in arr){print arr[i]}}' file1.txt file2.txt

output

"switch6"
"router12"
"router44";"DatacenterC - cab2";"Test - Tenant2"
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch10"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"

Disclaimer: I assume that you are able to accept any order of lines in output file. Explanation: I inform GNU AWK that ; is used as field separator (FS) then when processing first file (NR==FNR) I do fill array arr by putting whole current line ($0) under key which is content of first field ($1). When processing next files (NR!=FNR) if there is value corresponding to first field of line in array arr then I update its value to whole current line. After all files are processed I do print all values from array arr using for.

(tested in gawk 4.2.1)

Solution 2:[2]

If awk is not mandatory, you can use the great and simple Miller

Running

mlr --csv --fs ";" -N join --ul -j 1 -f input_01.csv then unsparsify then sort -f 1 input_02.csv

you have

router12;;
router44;DatacenterC - cab2;Test - Tenant2
router51;DatacenterA - cab1;Prod - Tenant12
switch10;;
switch2;DatacenterA - cab3;Dev - Tenant5
switch33;DatacenterB - cab14;Prod - Tenant4
switch6;;

Some notes:

  • --csv --fs ";" to set that the input and the output is a CSV in which the field separator is ;
  • -N to set that the input and the output do not have header row;
  • join --ul -j 1 to run a left join based on the first column

Solution 3:[3]

Similar idea as @Daweo but:

  1. use the file names provided
  2. hard-code file names in script instead of arguments as they are not interchangeable
  3. yield result in requested order

You would save the script as task, chmod 755 task then run it as ./task > file3.csv:

#!/usr/bin/env -S awk -F; -f

BEGIN {
    while ( (getline <"file2.csv") > 0 ) {
        lookup[$1] = $0
    }
    while ( ("sort -k1.1,1.7 -k1.8n file1.csv" | getline) > 0 ) {
        print ($0 in lookup) ? lookup[$0] : $0
    }
}

Note: sort used here is fragile as it relies on fixed size prefix (router or switch) to identify the two parts of the compound key.

Using gawk's PROCINFO["sorted_in"] feature to return keys in a specify order and using a custom compare function str_num_cmp() makes it a bit more robust:

#!/usr/bin/env -S awk -F; -f

function str_num_to_a(i, a) {
    match(i, /"([^0-9]*)([0-9]+)"/, a)
}

# a1/a2 are specified as arguments to make them local variables
function str_num_cmp(i1, v1, i2, v2, a1, a2) {
    str_num_to_a(i1, a1)
    str_num_to_a(i2, a2)
    if(a1[1] < a2[1])
        return -1
    if(a1[1] == a2[1]) {
        if(a1[2] < a2[2])
            return -1
        if(a1[2] == a2[2])
            return 0
    }
    return 1
}

BEGIN {
    while ( (getline <"file1.csv") > 0 ) {
        keys[$1] = ""
    }
    while ( (getline <"file2.csv") > 0 ) {
        lookup[$1] = $0
    }
    PROCINFO["sorted_in"] = "str_num_cmp"
    for (k in keys) {
        print (k in lookup) ? lookup[k] : k
    }
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Daweo
Solution 2 aborruso
Solution 3