'Calculate correct number of characters using Awk ggsub

I'm trying to calculate the number of a specific character pattern in a sequence (fasta format). In my case I want to calculate how often the context "CC" is present in a sequence. The whole script is working fine, but I experienced one small problem.

For calculating the "CC" context I use the following part of my script:

CC=gsub(/CC/,"CC");
print CC

I experience a problem when I have a fasta sequence like this:

>name_sequence_1
CCCCC 

In this case, the number of CC should be 4 (positions 1-2, 2-3, 3-4, and 4-5), but gsub gives me the number 2, because after substituting the first CC, it jumps to the 3rd C and so on.

Is there any way how I can fix that using gsub or is there another code I can use to calculate such contexts?

Thanks!



Solution 1:[1]

This MAY be what you're trying to do, assuming the expected output you stated is wrong:

$ echo 'CCCCC' |
    awk '{
        str = $0
        cnt = 0
        while ( sub(/CC/,"C",str) ) {
            cnt++
        }
        print cnt
    }'
4

$ echo 'CCCACCCCC' |
    awk '{
        str = $0
        cnt = 0
        while ( sub(/CC/,"C",str) ) {
            cnt++
        }
        print cnt
    }'
6

but here's a more robust general solution that'll work even when the target string isn't a repetition of 1 character and/or it contains regexp or backreference metachars:

$ echo 'CCCCC' |
    awk '{
        cnt = 0
        for ( i=1; i<length($0); i++ ) {
            cnt += ( substr($0,i,2) == "CC" )
        }
        print cnt
    }'
4

$ echo 'CCCACCCCC' |
    awk '{
        cnt = 0
        for ( i=1; i<length($0); i++ ) {
            cnt += ( substr($0,i,2) == "CC" )
        }
        print cnt
    }'
6

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1