'awk - understanding how FS works
I know that default FS= " ", then why am i seeing variations in following awk commands. Please help me understand.
>echo " ABC DEF XYZ \n abc def,ghi xyz \n" | awk '{printf("nf: %s 1:%s line: %s\n", NF, $1, $0)}'
nf: 3 1:ABC line: ABC DEF XYZ
nf: 3 1:abc line: abc def,ghi xyz
nf: 0 1: line:
>echo " ABC DEF XYZ \n abc def,ghi xyz \n" | awk -F" " '{printf("nf: %s 1:%s line: %s\n", NF, $1, $0)}'
nf: 3 1:ABC line: ABC DEF XYZ
nf: 3 1:abc line: abc def,ghi xyz
nf: 0 1: line:
>echo " ABC DEF XYZ \n abc def,ghi xyz \n" | awk -F"[ ]" '{printf("nf: %s 1:%s line: %s\n", NF, $1, $0)}'
nf: 10 1: line: ABC DEF XYZ
nf: 17 1: line: abc def,ghi xyz
nf: 0 1: line:
>echo " ABC DEF XYZ \n abc def,ghi xyz \n" | awk -F"[ ]*" '{printf("nf: %s 1:%s line: %s\n", NF, $1, $0)}'
nf: 5 1: line: ABC DEF XYZ
nf: 5 1: line: abc def,ghi xyz
nf: 0 1: line:
I want to understand why there are no empty tokens in 1st & 2nd examples, but exists in 3rd & 4th examples.
Update: To explain my doubt further, awk behaves inconsistently with default FS and custom FS. See below examples.
>printf "ab cd\nef gh\n" | awk -F" " '{ printf("nf: %d\t", NF); for (i=1;i<=NF;i++) printf("%02d:%s\t", i, $i); print ""}'
nf: 2 01:ab 02:cd
nf: 2 01:ef 02:gh
>printf "ab::cd\nef:gh\n" | awk -F":" '{ printf("nf: %d\t", NF); for (i=1;i<=NF;i++) printf("%02d:%s\t", i, $i); print ""}'
nf: 3 01:ab 02: 03:cd
nf: 2 01:ef 02:gh
Solution 1:[1]
By default awk uses a single space as the default FS. This is a special case and is the only special case. Two or more spaces are not interpreted as multiple fields, but as a single separator. Using any other character causes each occurrence of that character to be interpreted as a separator. So using ':' will interpret ":::my" as four fields. (empty, empty, empty, "my") See: GNU Awk User's Guide - 4.5.1 Whitespace Normally Separates Fields.
When you use a Regular Expression, each occurrence of the FS character (even a space) is considered a separate field separator. See GNU Awk User's Guide - 4.5.2 Using Regular Expressions to Separate Fields.
To examine every character as a separate field, you can simply set FS to the empty-string (null), either on the command line with -F"" or by setting FS = "".
In your examples where you use the Regex -F"[ ]" each space is considered a separate field separator. FS is a Regex and not the default case. It is a Regex where the single character just happens to be a space.
With the repetition of * (zero-or-more) occurrences, the FS is a bit ambiguous. It can match nothing (null) or it can match as many spaces as there are in a row. (which is why it matches the very first character and then multiple spaces) I do not recommend messing with spaces and FS in this manner.
awk understands Extended Regular Expression (ERE) syntax, so you can use the '+' repetition specifier for one-or-more occurrences of the character.
Keep the GNU Awk User's Guide handy. It is a good reference for gawk as well as the other flavors of awk. In the guide if something is unique to gawk, it will be marked with a '#' in the guide to tell you. It usually explains (sometimes in a footnote) how the gawk behavior is different than POSIX awk or mawk, etc..
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
