'Calculate correct number of characters using Awk ggsub
I'm trying to calculate the number of a specific character pattern in a sequence (fasta format). In my case I want to calculate how often the context "CC" is present in a sequence. The whole script is working fine, but I experienced one small problem.
For calculating the "CC" context I use the following part of my script:
CC=gsub(/CC/,"CC");
print CC
I experience a problem when I have a fasta sequence like this:
>name_sequence_1
CCCCC
In this case, the number of CC should be 4 (positions 1-2, 2-3, 3-4, and 4-5), but gsub gives me the number 2, because after substituting the first CC, it jumps to the 3rd C and so on.
Is there any way how I can fix that using gsub or is there another code I can use to calculate such contexts?
Thanks!
Solution 1:[1]
This MAY be what you're trying to do, assuming the expected output you stated is wrong:
$ echo 'CCCCC' |
awk '{
str = $0
cnt = 0
while ( sub(/CC/,"C",str) ) {
cnt++
}
print cnt
}'
4
$ echo 'CCCACCCCC' |
awk '{
str = $0
cnt = 0
while ( sub(/CC/,"C",str) ) {
cnt++
}
print cnt
}'
6
but here's a more robust general solution that'll work even when the target string isn't a repetition of 1 character and/or it contains regexp or backreference metachars:
$ echo 'CCCCC' |
awk '{
cnt = 0
for ( i=1; i<length($0); i++ ) {
cnt += ( substr($0,i,2) == "CC" )
}
print cnt
}'
4
$ echo 'CCCACCCCC' |
awk '{
cnt = 0
for ( i=1; i<length($0); i++ ) {
cnt += ( substr($0,i,2) == "CC" )
}
print cnt
}'
6
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
