'AWK: How to get count of column couples?

I have a CSV file made of many couples of columns, each couple has code_### and name_###.

code_boat|name_boat|year|code_color|name_color|code_size|name_size
1|jeanneau|2000|#00f|blue|5|small
2|bavaria|2005|#00f|blue|10|big
1|jeanneau|2010|#f00|red|10|big
2|bavaria|2008|#000|white|5|small
3|fountaine-pajot|2005|#f00|red|5|small
1|jeanneau|2012|#000|white|5|small
code_boat │       name_boat │ year │ code_color │ name_color │ code_size │ name_size
──────────┼─────────────────┼──────┼────────────┼────────────┼───────────┼───────────
        1 │        jeanneau │ 2000 │       #00f │       blue │         5 │     small
        2 │         bavaria │ 2005 │       #00f │       blue │        10 │       big
        1 │        jeanneau │ 2010 │       #f00 │        red │        10 │       big
        2 │         bavaria │ 2008 │       #000 │      white │         5 │     small
        3 │ fountaine-pajot │ 2005 │       #f00 │        red │         5 │     small
        1 │        jeanneau │ 2012 │       #000 │      white │         5 │     small

I need to count how many times these couples are used, and keep the couple index:

couple_index │  code │            name │ count
─────────────┼───────┼─────────────────┼───────
           0 │     1 │        jeanneau │     3
           0 │     2 │         bavaria │     2
           0 │     3 │ fountaine-pajot │     1
           2 │  #000 │           white │     2
           2 │  #f00 │             red │     2
           2 │  #00f │            blue │     2
           4 │     5 │           small │     4
           4 │    10 │             big │     2
0|1|jeanneau|3
0|2|bavaria|2
0|3|fountaine-pajot|1
2|#000|white|2
2|#f00|red|2
2|#00f|blue|2
4|5|small|4
4|10|big|2

I know how to do it couple by couple with awk, but I'd like to do all at once, because the csv files are pretty big.

awk -F'|' '{c[$39" "$40]++} END{for (i in c) {if (c[i]>0) print i,c[i]}}' myfile.csv
awk


Solution 1:[1]

Assumptions/Understandings:

  • from OP's comments the actual data file is pipe-delimited with no leading/trailing spaces in fields (see modified input file - below)
  • output is to be generated in the same format (ie, pipe-delimited with no leading/trailing spaces in fields)

Sample input file:

$ cat myfile.csv
boat_CODE|boat_NAME|color_CODE|color_NAME|size_CODE|size_NAME
1|jeanneau|#00f|blue|5|small
2|bavaria|#00f|blue|10|big
1|jeanneau|#f00|red|10|big
2|bavaria|#000|white|5|small
3|fountaine-pajot|#f00|red|5|small
1|jeanneau|#000|white|5|small

NOTE: will need to come back and modify the code depending on what, if any, header record(s) actually exists in the file


One GNU awk idea making use of arrays of arrays (aka multi-dimensional arrays):

awk '
BEGIN { FS=OFS="|" }
NR>1  { for (i=1;i<=NF;i+=2)
            counts[(i-1)][$i][$(i+1)]++
      }
END   { print "couple_index","CODE","NAME","count"
        for (ndx=0;ndx<NF;ndx+=2)
            for (code in counts[ndx])
                for (name in counts[ndx][code])
                    print ndx,code,name,counts[ndx][code][name]
      }
' myfile.csv

This generates:

couple_index|CODE|NAME|count
0|1|jeanneau|3
0|2|bavaria|2
0|3|fountaine-pajot|1
2|#000|white|2
2|#00f|blue|2
2|#f00|red|2
4|5|small|4
4|10|big|2

OP has mentioned in comments they are running on macOS; assuming GNU awk is not available we can use a multi-value hash as the index for a single-dimensional array, eg:

awk '
BEGIN { FS=OFS="|" }
NR>1  { for (i=1;i<=NF;i+=2)
            counts[(i-1) FS $i FS $(i+1)]++
      }
END   { print "couple_index","CODE","NAME","count"
        for (i in counts)
            print i,counts[i]
      }
' myfile.csv

This generates:

couple_index|CODE|NAME|count
0|3|fountaine-pajot|1
2|#f00|red|2
4|5|small|4
0|1|jeanneau|3
4|10|big|2
2|#000|white|2
0|2|bavaria|2
2|#00f|blue|2

Sorting:

If the result needs to be sorted this will probably be easier in bash via the sort command:

  • remove the print "couple_index","CODE","NAME","count" from both awk solutions; instead move this up to the command line
  • pipe the awk results to sort

One idea:

echo "couple_index|CODE|NAME|count"                      > result.csv
awk '.....' myfile.csv | sort -t'|' -k1,1n -k2,2V -k3,3 >> result.csv

Both awk solutions generate:

$ cat result.csv
couple_index|CODE|NAME|count
0|1|jeanneau|3
0|2|bavaria|2
0|3|fountaine-pajot|1
2|#000|white|2
2|#00f|blue|2
2|#f00|red|2
4|5|small|4
4|10|big|2

Solution 2:[2]

If the couples are always one next to another, you can easily do it with a loop:

awk 'BEGIN{FS=OFS="|"}
     (FNR>2){for(i=1;i<=NF;i+=2) { k=$i OFS $(i+1); c[k]++; d[k] = i } }
     END{for (k in c) print d[k],k,c[k] }' file

This does not take care of issues that could be the result misalignment or typos.

If the table has intermediate columns that are of no interest to the problem at hand, it is paramount to process the header first:

awk 'BEGIN{FS=OFS="|"}
     (FNR==1) { for(i=1;i<=NF;++i) if ($i ~ /_CODE *$/) { idx[i] } }
     (FNR>2)  { for(i in idx) { k=$i OFS $(i+1); c[k]++; d[k] = i } }
     END{for (k in c) print d[k],k,c[k] }' file

Solution 3:[3]

I would implement counting of pairs as follows, let file.txt content be

boat_CODE |         boat_NAME | color_CODE | color_NAME | size_CODE | size_NAME
        1 |          jeanneau |       #00f |       blue |         5 |     small
        2 |           bavaria |       #00f |       blue |        10 |       big
        1 |          jeanneau |       #f00 |        red |        10 |       big
        2 |           bavaria |       #000 |      white |         5 |     small
        3 |   fountaine-pajot |       #f00 |        red |         5 |     small
        1 |          jeanneau |       #000 |      white |         5 |     small

then

awk 'BEGIN{FPAT="[^[:space:]|]+"}NR>1{for(i=1;i<=NF;i+=2){c[$i" "$(i+1)]+=1}}END{for(i in c){printf "%-25s%s\n",i,c[i]}}' file.txt

output

10 big                   2
#00f blue                2
#f00 red                 2
2 bavaria                2
5 small                  4
3 fountaine-pajot        1
#000 white               2
1 jeanneau               3

Explanation: I inform GNU AWK that field consist of one or more (+) characters which are not (^) whitespace ([:space:]) and |. Then for each row after first (NR>1) I iterate using for loop with step being 2, and increase value in array c under key being concatenation of this column value and space and next column value. After all lines are processing I printf key-value pairs from array c, with key being leftjusted in string of length 25 (feel free to change this to fit your needs). Disclaimer: This solution assume there is never whitespace inside value.

(tested in gawk 4.2.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Daweo