'How to outer-join two CSV files, using shell script?

I have two CSV files, like the following:

file1.csv

label,"Part-A"
"ABC mn","2.0"
"XYZ","3.0"
"PQR SN","6"

file2.csv

label,"Part-B"
"XYZ","4.0"
"LMN Wv","8"
"PQR SN","6"
"EFG","1.0"

Desired Output.csv

label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

Currently with the below awk command i am able to combine the matching one's which have entries for label in both the files like PQR and XYZ but unable to append the ones that are not having label values present in both the files:

awk -F, 'NR==FNR{a[$1]=substr($0,length($1)+2);next} ($1 in a){print $0","a[$1]}' file1.csv file2.csv


Solution 1:[1]

We suggest gawk script which is standard Linux awk:

script.awk

NR == FNR {
  valsStr = sprintf("%s,%s", $2, "na");
  rowsArr[$1] = valsStr;
}
NR != FNR && $1 in rowsArr {
  split(rowsArr[$1],valsArr);
  valsStr = sprintf("%s,%s", valsArr[1], $2);
  rowsArr[$1] = valsStr;
  next;
}
NR != FNR {
  valsStr = sprintf("%s,%s", "na", $2);
  rowsArr[$1] = valsStr;
}
END {
  printf("%s,%s\n", "label", rowsArr["label"]);
  for (rowName in rowsArr) {
     if (rowName == "label") continue;
     printf("%s,%s\n", rowName, rowsArr[rowName]);
  }
}

output:

awk -F, -f script.awk input.{1,2}.txt

label,Part-A,Part-B
LMN,na,8
ABC,2,na
PQR,6,6
EFG,na,1
XYZ,3,4

Solution 2:[2]

This solution prints exactly the wished result with any AWK. Please note that the sorting algorithm is taken from the mawk manual.

# SO71053039.awk

#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP,           n,i,j,hold ) {
  n = 0
  for (j in A)
    A_SWAP[++n] = j
  for( i = 2 ; i <= n ; i++)
  {
    hold = A_SWAP[j = i]
    while ( A_SWAP[j-1] "" > "" hold )
    { j-- ; A_SWAP[j+1] = A_SWAP[j] }
    A_SWAP[j] = hold
  }
  # sentinel A_SWAP[0] = "" will be created if needed
  return n
}

BEGIN {
  FS = OFS = ","
  out = "Output.csv"

  # read file 1
  fnr = 0
  while ((getline < ARGV[1]) > 0) {
    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME1["label"]]
      LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["\"Part-A\""]
    }
  }
  close(ARGV[1])

  # read file2
  fnr = 0
  while ((getline < ARGV[2]) > 0) {
    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME2["label"]]
      LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["\"Part-B\""]
    }
  }
  close(ARGV[2])

  # print the header
  print "label" OFS "\"Part-A\"" OFS "\"Part-B\"" > out

  # get the result
  z = isort(LABEL_KEY, LABEL_KEY_SWAP)
  for (i = 1; i <= z; i++) {
    result_string = sprintf("%s", LABEL_KEY_SWAP[i])
    if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
      result_string = sprintf("%s", result_string OFS LABEL_KEY1[LABEL_KEY_SWAP[i]] OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? LABEL_KEY2[LABEL_KEY_SWAP[i]] : "NA"))
    else
      result_string = sprintf("%s", result_string OFS "NA" OFS LABEL_KEY2[LABEL_KEY_SWAP[i]])
    print result_string > out
  }
}

Call:

awk -f SO71053039.awk file1.csv file2.csv
=> result file Output.csv with content:
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

Solution 3:[3]

I would like to introduce Miller to you. It is a tool that can do a few things with a few file formats and that is available as a stand-alone binary. You just have to download the archive, put the mlr executable somewhere (preferably in your PATH) and you're done with the installation.

mlr --csv \
    join -f file1.csv -j 'label' --ul --ur \
    then \
    unsparsify --fill-with 'NA' \
    then \
    sort -f 'label' \
    file2.csv

Command parts:

  • mlr --csv
    means that you want to read CSV files and output a CSV format. As an other example, if you want to read CSV files and output a JSON format it would be mlr --icsv --ojson
  • join -f file1.csv -j 'label' --ul --ur ...... file2.csv
    means to join file1.csv and file2.csv on the field label and emit the unmatching records of both files
  • then is Miller's way of chaining operations
  • unsparsify --fill-with 'NA'
    means to create the fields that didn't exist in each file and fill them with NA. It's needed for the records that had a uniq label
  • then sort -f 'label'
    means to sort the records on the field label

Regarding the updated question: mlr handles the CSV quoting on its own. The only difference with your new expected output is that it removes the superfluous quotes:

label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0

Solution 4:[4]

awk -v OFS=, '{
        if(!o1[$1]) { o1[$1]=$NF; o2[$1]="NA" } else { o2[$1]=$NF }
    } 
    END{
        for(v in o1) { print v, o1[v], o2[v] }
    }' file{1,2}

## output
LMN,8,NA
ABC,2,NA
PQR,6,6
EFG,1,NA
XYZ,3,4

I think this will do nicely.

Solution 5:[5]

Since your question was titled with "how to do ... in a shell script?" and not necessarily with awk, I'm going to recommend GoCSV, a command-line tool with several sub-commands for processing CSVs (delimited files).

It doesn't have a single command that can accomplish what you need, but you can compose a number of commands to get the correct result.

The core of this solution is the join command which can perform inner (default), left, right, and outer joins; you want an outer join to keep the non-overlapping elements:

gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv
Joined
+-------+--------+-------+--------+
| label | Part-A | label | Part-B |
+-------+--------+-------+--------+
| ABC   | 2      |       |        |
+-------+--------+-------+--------+
| XYZ   | 3      | XYZ   | 4      |
+-------+--------+-------+--------+
| PQR   | 6      | PQR   | 6      |
+-------+--------+-------+--------+
|       |        | LMN   | 8      |
+-------+--------+-------+--------+
|       |        | EFG   | 1      |
+-------+--------+-------+--------+

The data-part is correct, but it'll take some work to get the columns correct, and to get the NA values in there.

Here's a complete pipeline:

gocsv join -c 'label' -outer file1.csv file2.csv \
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
| gocsv select -c 'label','Part-A','Part-B' \
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
> final.csv

echo 'Final'
gocsv view final.csv

which gets us the correct, final, file:

Final pipeline
+-------+--------+--------+
| label | Part-A | Part-B |
+-------+--------+--------+
| ABC   | 2      | NA     |
+-------+--------+--------+
| EFG   | NA     | 1      |
+-------+--------+--------+
| LMN   | NA     | 8      |
+-------+--------+--------+
| PQR   | 6      | 6      |
+-------+--------+--------+
| XYZ   | 3      | 4      |
+-------+--------+--------+

There's a lot going on in that pipeline, the high points are:

Merge the the two label fields

| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \

Pare-down to just the 3 columns you want

| gocsv select -c 'label','Part-A','Part-B' \

Add the NA values and sort by label

| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \

I've made a step-by-step explanation at this Gist.

Solution 6:[6]

You mentioned join in the comment on my other answer, and I'd forgotten about this utility:

#!/bin/sh
rm -f *sorted.csv

# Join two files, normally inner-join only, but
# -  `-a 1 -a 2`:    include "unpaired lines" from file 1 and file 2
# -  `-1 1 -2 1`:    the first column from each is the "join column"
# -  `-o 0,1.2,2.2`: output the "join column" (0) and the second fields from files 1 and 2

join -a 1 -a 2 -1 1 -2 1 -o '0,1.2,2.2' -t, file1.csv file2.csv > joined.csv 

# Add NA values
cat joined.csv | sed 's/,,/,NA,/' | sed 's/,$/,NA/' > unsorted.csv

# Sort, pull out header first
head -n 1 unsorted.csv > sorted.csv

# Then sort remainder
tail -n +2 unsorted.csv | sort -t, -k 1 >> sorted.csv

And, here's sorted.csv

+--------+--------+--------+
| label  | Part-A | Part-B |
+--------+--------+--------+
| ABC mn | 2.0    | NA     |
+--------+--------+--------+
| EFG    | NA     | 1.0    |
+--------+--------+--------+
| LMN Wv | NA     | 8      |
+--------+--------+--------+
| PQR SN | 6      | 6      |
+--------+--------+--------+
| XYZ    | 3.0    | 4.0    |
+--------+--------+--------+

Solution 7:[7]

As @Fravadona stated correctly in his comment, for CSV files that can contain the delimiter, a newline or double quotes inside a field a proper CSV parser is needed.

Actually, only two functions are needed: One for unquoting CSV fields to normal AWK fields and one for quoting the AWK fields to write the data back to CSV fields.

I have written a variant of my previous answer (https://stackoverflow.com/a/71056926/18135892) that uses Ed Morton's CSV parser (https://stackoverflow.com/a/45420607/18135892 with the gsub variant which works with any AWK version) to give an example of proper CSV parsing:

This solution prints the wished result correctly sorted with any AWK. Please note that the sorting algorithm is taken from the mawk manual.

# SO71053039_2.awk

# unquote CSV:
# Ed Morton's CSV parser: https://stackoverflow.com/a/45420607/18135892
function buildRec(      fpat,fldNr,fldStr,done) {
    CurrRec = CurrRec $0
    if ( gsub(/"/,"&",CurrRec) % 2 ) {
        # The string built so far in CurrRec has an odd number
        # of "s and so is not yet a complete record.
        CurrRec = CurrRec RS
        done = 0
    }
    else {
        # If CurrRec ended with a null field we would exit the
        # loop below before handling it so ensure that cannot happen.
        # We use a regexp comparison using a bracket expression here
        # and in fpat so it will work even if FS is a regexp metachar
        # or a multi-char string like "\\\\" for \-separated fields.
        CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
        $0 = ""
        fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
        while ( (CurrRec != "") && match(CurrRec,fpat) ) {
            fldStr = substr(CurrRec,RSTART,RLENGTH)
            # Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
            if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
                gsub(/""/, "\"", fldStr)
            }
            $(++fldNr) = fldStr
            CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
        }
        CurrRec = ""
        done = 1
    }
    return done
}

# quote CSV:
# Quote according to https://datatracker.ietf.org/doc/html/rfc4180 rules
function csvQuote(field, sep) {
  if ((field ~ sep) || (field ~ /["\r\n]/)) {
    gsub(/"/, "\"\"", field)
    field = "\"" field "\""
  }
  return field
}

#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP,           n,i,j,hold ) {
  n = 0
  for (j in A)
    A_SWAP[++n] = j
  for( i = 2 ; i <= n ; i++)
  {
    hold = A_SWAP[j = i]
    while ( A_SWAP[j-1] "" > "" hold )
    { j-- ; A_SWAP[j+1] = A_SWAP[j] }
    A_SWAP[j] = hold
  }
  # sentinel A_SWAP[0] = "" will be created if needed
  return n
}

BEGIN {
  FS = OFS = ","

  # read file 1
  fnr = 0
  while ((getline < ARGV[1]) > 0) {
    if (! buildRec())
      continue

    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME1["label"]]
      LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["Part-A"]
    }
  }
  close(ARGV[1])

  # read file2
  fnr = 0
  while ((getline < ARGV[2]) > 0) {
    if (! buildRec())
      continue

    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME2["label"]]
      LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["Part-B"]
    }
  }
  close(ARGV[2])

  # print the header
  print "label" OFS "Part-A" OFS "Part-B"

  # get the result
  z = isort(LABEL_KEY, LABEL_KEY_SWAP)
  for (i = 1; i <= z; i++) {
    result_string = sprintf("%s", csvQuote(LABEL_KEY_SWAP[i], OFS))
    if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
      result_string = sprintf("%s", result_string OFS csvQuote(LABEL_KEY1[LABEL_KEY_SWAP[i]], OFS) OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS) : "NA"))
    else
      result_string = sprintf("%s", result_string OFS "NA" OFS csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS))
    print result_string
  }
}

Call:

awk -f SO71053039_2.awk file1.csv file2.csv
=> result (superfluous quotes according to CSV rules are omitted):
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3
Solution 4
Solution 5
Solution 6
Solution 7