'AWK: How to get count of column couples?
I have a CSV file made of many couples of columns, each couple has code_### and name_###.
code_boat|name_boat|year|code_color|name_color|code_size|name_size
1|jeanneau|2000|#00f|blue|5|small
2|bavaria|2005|#00f|blue|10|big
1|jeanneau|2010|#f00|red|10|big
2|bavaria|2008|#000|white|5|small
3|fountaine-pajot|2005|#f00|red|5|small
1|jeanneau|2012|#000|white|5|small
code_boat │ name_boat │ year │ code_color │ name_color │ code_size │ name_size
──────────┼─────────────────┼──────┼────────────┼────────────┼───────────┼───────────
1 │ jeanneau │ 2000 │ #00f │ blue │ 5 │ small
2 │ bavaria │ 2005 │ #00f │ blue │ 10 │ big
1 │ jeanneau │ 2010 │ #f00 │ red │ 10 │ big
2 │ bavaria │ 2008 │ #000 │ white │ 5 │ small
3 │ fountaine-pajot │ 2005 │ #f00 │ red │ 5 │ small
1 │ jeanneau │ 2012 │ #000 │ white │ 5 │ small
I need to count how many times these couples are used, and keep the couple index:
couple_index │ code │ name │ count
─────────────┼───────┼─────────────────┼───────
0 │ 1 │ jeanneau │ 3
0 │ 2 │ bavaria │ 2
0 │ 3 │ fountaine-pajot │ 1
2 │ #000 │ white │ 2
2 │ #f00 │ red │ 2
2 │ #00f │ blue │ 2
4 │ 5 │ small │ 4
4 │ 10 │ big │ 2
0|1|jeanneau|3
0|2|bavaria|2
0|3|fountaine-pajot|1
2|#000|white|2
2|#f00|red|2
2|#00f|blue|2
4|5|small|4
4|10|big|2
I know how to do it couple by couple with awk, but I'd like to do all at once, because the csv files are pretty big.
awk -F'|' '{c[$39" "$40]++} END{for (i in c) {if (c[i]>0) print i,c[i]}}' myfile.csv
Solution 1:[1]
Assumptions/Understandings:
- from OP's comments the actual data file is pipe-delimited with no leading/trailing spaces in fields (see modified input file - below)
- output is to be generated in the same format (ie, pipe-delimited with no leading/trailing spaces in fields)
Sample input file:
$ cat myfile.csv
boat_CODE|boat_NAME|color_CODE|color_NAME|size_CODE|size_NAME
1|jeanneau|#00f|blue|5|small
2|bavaria|#00f|blue|10|big
1|jeanneau|#f00|red|10|big
2|bavaria|#000|white|5|small
3|fountaine-pajot|#f00|red|5|small
1|jeanneau|#000|white|5|small
NOTE: will need to come back and modify the code depending on what, if any, header record(s) actually exists in the file
One GNU awk idea making use of arrays of arrays (aka multi-dimensional arrays):
awk '
BEGIN { FS=OFS="|" }
NR>1 { for (i=1;i<=NF;i+=2)
counts[(i-1)][$i][$(i+1)]++
}
END { print "couple_index","CODE","NAME","count"
for (ndx=0;ndx<NF;ndx+=2)
for (code in counts[ndx])
for (name in counts[ndx][code])
print ndx,code,name,counts[ndx][code][name]
}
' myfile.csv
This generates:
couple_index|CODE|NAME|count
0|1|jeanneau|3
0|2|bavaria|2
0|3|fountaine-pajot|1
2|#000|white|2
2|#00f|blue|2
2|#f00|red|2
4|5|small|4
4|10|big|2
OP has mentioned in comments they are running on macOS; assuming GNU awk is not available we can use a multi-value hash as the index for a single-dimensional array, eg:
awk '
BEGIN { FS=OFS="|" }
NR>1 { for (i=1;i<=NF;i+=2)
counts[(i-1) FS $i FS $(i+1)]++
}
END { print "couple_index","CODE","NAME","count"
for (i in counts)
print i,counts[i]
}
' myfile.csv
This generates:
couple_index|CODE|NAME|count
0|3|fountaine-pajot|1
2|#f00|red|2
4|5|small|4
0|1|jeanneau|3
4|10|big|2
2|#000|white|2
0|2|bavaria|2
2|#00f|blue|2
Sorting:
If the result needs to be sorted this will probably be easier in bash via the sort command:
- remove the
print "couple_index","CODE","NAME","count"from bothawksolutions; instead move this up to the command line - pipe the
awkresults tosort
One idea:
echo "couple_index|CODE|NAME|count" > result.csv
awk '.....' myfile.csv | sort -t'|' -k1,1n -k2,2V -k3,3 >> result.csv
Both awk solutions generate:
$ cat result.csv
couple_index|CODE|NAME|count
0|1|jeanneau|3
0|2|bavaria|2
0|3|fountaine-pajot|1
2|#000|white|2
2|#00f|blue|2
2|#f00|red|2
4|5|small|4
4|10|big|2
Solution 2:[2]
If the couples are always one next to another, you can easily do it with a loop:
awk 'BEGIN{FS=OFS="|"}
(FNR>2){for(i=1;i<=NF;i+=2) { k=$i OFS $(i+1); c[k]++; d[k] = i } }
END{for (k in c) print d[k],k,c[k] }' file
This does not take care of issues that could be the result misalignment or typos.
If the table has intermediate columns that are of no interest to the problem at hand, it is paramount to process the header first:
awk 'BEGIN{FS=OFS="|"}
(FNR==1) { for(i=1;i<=NF;++i) if ($i ~ /_CODE *$/) { idx[i] } }
(FNR>2) { for(i in idx) { k=$i OFS $(i+1); c[k]++; d[k] = i } }
END{for (k in c) print d[k],k,c[k] }' file
Solution 3:[3]
I would implement counting of pairs as follows, let file.txt content be
boat_CODE | boat_NAME | color_CODE | color_NAME | size_CODE | size_NAME
1 | jeanneau | #00f | blue | 5 | small
2 | bavaria | #00f | blue | 10 | big
1 | jeanneau | #f00 | red | 10 | big
2 | bavaria | #000 | white | 5 | small
3 | fountaine-pajot | #f00 | red | 5 | small
1 | jeanneau | #000 | white | 5 | small
then
awk 'BEGIN{FPAT="[^[:space:]|]+"}NR>1{for(i=1;i<=NF;i+=2){c[$i" "$(i+1)]+=1}}END{for(i in c){printf "%-25s%s\n",i,c[i]}}' file.txt
output
10 big 2
#00f blue 2
#f00 red 2
2 bavaria 2
5 small 4
3 fountaine-pajot 1
#000 white 2
1 jeanneau 3
Explanation: I inform GNU AWK that field consist of one or more (+) characters which are not (^) whitespace ([:space:]) and |. Then for each row after first (NR>1) I iterate using for loop with step being 2, and increase value in array c under key being concatenation of this column value and space and next column value. After all lines are processing I printf key-value pairs from array c, with key being leftjusted in string of length 25 (feel free to change this to fit your needs). Disclaimer: This solution assume there is never whitespace inside value.
(tested in gawk 4.2.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | Daweo |
