'How to generate string with a char repeated for n times?

I can use the following code to generate a string.

$ awk -e 'BEGIN { for(i=1;i<=10;++i) s = s "x"; print s }'
xxxxxxxxxx

But its complexity is super-linear wrt the string length.

$ time awk -e 'BEGIN { for(i=1;i<=10000000;++i) s = s "x" }'

real    0m0.868s
user    0m0.857s
sys 0m0.008s
$ time awk -e 'BEGIN { for(i=1;i<=100000000;++i) s = s "x" }'

real    0m9.886s
user    0m9.801s
sys 0m0.065s
$ time awk -e 'BEGIN { for(i=1;i<=1000000000;++i) s = s "x" }'

real    1m46.074s
user    1m45.171s
sys 0m0.760s

Is there a better way to repeat a char n times and assign the result to a variable?

awk


Solution 1:[1]

Use sprintf to create a string of spaces, then use gsub to replace each space with an x:

$  time awk  'BEGIN {s = sprintf("%*s", 100000000, ""); gsub(".", "x", s)}'

real    0m1.744s
user    0m1.645s
sys 0m0.098s

This can be wrapped in an awk function:

function mkStr(c, n,       s) { 
  s = sprintf("%*s", n, "");
  gsub(".", c, s);
  return s;
}

(s is a parameter simply to scope the variable to the function; it needs no argument, and indeed, any argument passed will be ignored.)


Update: there appears to be a significant difference in performance depending on which version of awk you are using. The above test used 20070501, the BSD(?) awk that ships with macOS. gawk-5.1.0 takes significantly longer.

I don't know what accounts for the difference; perhaps there is a solution that is fast in both versions.

Update 2: Ed Morton (in the comments) has verified that gsub is responsible for the slow running time in gawk, could not find a workaround, and has filed a bug report with the maintainers.

Solution 2:[2]

function loop(n){
        for(i=1;i<=n;i++) s = s "x";
        return s;
}

function repl(n){
        s = sprintf("%*s", n, "");
        gsub(/ /, "x", s);
    return s;
}

function recStack(n,    h){
        switch( n ){
                case 0:
                        return "";
                default:
                        if( n % 2 == 1 ){
                                h = recStack( int((n-1)/2) )
                                return h h "x";
                    } else {
                                h = recStack( int(n/2) )
                                return h h;
                        }
        }
}

function recStackIf(n,    h){
        if( n == 0 ) return "";
        if( n % 2 == 1 ){
                h = recStackIf( int((n-1)/2) ); # create first half
                return h h "x";                 # duplicate + one "x"
        } else {
                h = recStackIf( int(n/2) );  # create first half
                return h h;                  # duplicate
        }
}

function recArray(n,    h, n2){
    if( n in a ) return a[n];
        switch( n ){
                case 0:
                        return a[0] = "";
                default:
                        if( n % 2 == 1 ){
                                n2 = int((n-1)/2);
                                h = recArray( n2 );
                                return a[n] = h h "x";
                        } else {
                                n2 = int(n/2);
                                h = recArray( n2 );
                                return a[n] = h h;
                        }
        }
}

function recArrayIf(n,    h, n2){
    if( n in a ) return a[n];
        if( n == 0 ) return a[0] = "";
        if( n % 2 == 1 ){
                n2 = int((n-1)/2);
                h = recArrayIf( n2 );
                return a[n] = h h "x";
        } else {
                n2 = int(n/2);
                h = recArrayIf( n2 );
                return a[n] = h h;
        }
}


function concat(n){
    exponent = log(n)/log(2)
    m = int(exponent)                   # floor
    m += (m < exponent ? 1 : 0)         # ceiling
    s = "x"
    for (i=1; i<=m; i++) {
        s = s s
    }
    s = substr(s,1,n)
    return s
}

BEGIN {
        switch( F ){
                case "recStack":
                        xxx = recStack( 100000000 );
                        break;
                case "recStackIf":
                        xxx = recStackIf( 100000000 );
                        break;
                case "recArray":
                        xxx = recArray( 100000000 );
                        break;
                case "recArrayIf":
                        xxx = recArrayIf( 100000000 );
                        break;
                case "loop":
                        xxx = loop( 100000000 );
                        break;
                case "repl":
                        xxx = repl( 100000000 );
                        break;
                case "concat":
                        xxx = concat( 100000000 );
                        break;
        }

print length(xxx);
##      xloop = loop(100000000 );
##      if( xxx == xloop ) print "Match";
}

Times are:

# loop :            real 0m5,405s, user 0m5,243s, sys 0m0,160s
# repl :            real 0m7,670s, user 0m7,506s, sys 0m0,164s
# recArray:         real 0m0,302s, user 0m0,141s, sys   0m0,161s
# recArrayIf:       real 0m0,309s, user 0m0,168s, sys   0m0,141s
# recStack:         real 0m0,316s, user 0m0,124s, sys   0m0,192s
# recStackIf:       real 0m0,305s, user 0m0,152s, sys   0m0,152s
# concat:           real 0m0,664s, user 0m0,300s, sys   0m0,364s

There's not much difference between the 5 versions of binary decomposition: a bunch of heap memory is used in all cases. Having the global array hang around until the end of all times isn't good and therefore I'd prefer either stack version.

wlaun@terra:/tmp$ gawk -V GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0) wlaun@terra:/tmp$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit

Note that the above timing has been done with a statement printing the resulting string's length, which adds about 0.2 s to each version. Also, /usr/bin/time isn't reliable. Here are the relative "real" values from time without the print length(xxx):

# loop:       0m5,248s
# repl:       0m7,705s
# recStack:   0m0,103s
# recStackIf: 0m0,097s
# recArray:   0m0,103s
# recArrayIf: 0m0,099s
# concat:     0m0,455s

Added on request of Ed Morton:

Why is any of the recursive functions faster than any of the linear functions that iterate over O(N) elements? (The "O(N)" is the "big oh" symbol and is used to indicate a value that is N, possibly multiplied and/or incremented by some constant. A circle's circumference is O(r), it's area is O(r²).)

The answer is simple: By dividing N by 2, we get two strings of length O(N/2). This provides the possibility of re-using the result for the first half (no matter how we obtain it) for the second half! Thus, we'll get the second half of the result for free (except for the string copy operation, which is basically a machine instruction on most popular architectures). There is no reason why this great idea should not be applied for creating the first half as well, which means that we get three quarters of the result for free (except - see above). A little overhead results from the single "x" we have to throw in to cater for odd subdivisions of N.

There are many other recursive algorithms along the idea of halving and dealing with both sections individually, the most famous of them are Binary Search and Quicksort.

Solution 3:[3]

Here's a fast solution using any POSIX awk (tested on an average 8G RAM laptop running cygwin with GNU awk 5.1.0):

time awk -v n=100000000 'BEGIN {
    exponent = log(n)/log(2)
    m = int(exponent)                   # floor
    m += (m < exponent ? 1 : 0)         # ceiling
    s = "x"
    for (i=1; i<=m; i++) {
        s = s s
    }
    s = substr(s,1,n)
}'

real    0m0.665s
user    0m0.328s
sys     0m0.343s

The above just appends a copy of the string to itself until it's at least as big as the target string length and finally truncates it to exactly the desired length. The only somewhat tricky part is calculating how many times to append s to itself and that's just a case of solving 2^m = n for m given you already know n (100000000 in this case), see https://www.calculatorsoup.com/calculators/algebra/exponentsolve.php.

Obviously you could make the loop while ( length(s) < n ) instead of calculating m and that'd make the script briefer but would slow it down a bit (but it'd still be pretty fast):

$ time awk -v n=100000000 'BEGIN{s="x"; while (length(s) < n) s=s s; s=substr(s,1,n)}'

real    0m1.072s
user    0m0.562s
sys     0m0.483s

@JamesBrown had a better idea than calling length() on each iteration that also avoids having to calculate m while being about as fast:

$ time awk -v n=100000000 'BEGIN{s="x"; for (i=1; i<n; i*=2) s=s s; s=substr(s,1,n)}'

real    0m0.710s
user    0m0.281s
sys     0m0.390s

Originally I had the following thinking doubling strings of "x"s would be a faster approach than doubling individual "x"s on each iteration of the loop but it was a bit slower:

$ time awk -v n=100000000 '
    BEGIN {
        d = 1000000
        m = 7
        s = sprintf("%*s",d,"")
        gsub(/ /,"x",s)
        for (i=1; i<=m; i++) {
            s = s s
        }
        s = substr(s,1,n)
    }
'

real    0m1.030s
user    0m0.625s
sys     0m0.375s

The idea in that 2nd script was to generate a string long enough to be meaningful but short enough that gsub() can convert all blanks to "x"s quickly (which I've found to be 1000000 chars by trial and error), then just repeat the above process with fewer iterations.

I opened a bug report with the gawk providers about gsub() being slow for this case, see https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00030.html if you're interested, but it looks like the outcome of that will just be that gsub() is a heavy tool, it's not surprising that it takes a long time for 100 million regexp matches, but you can speed it up considerable by setting LC_ALL=C before it runs so gawk doesn't have to worry about multibyte locales.

Solution 4:[4]

If you are okay with perl:

$ perl -e 'print "x" x 10'
xxxxxxxxxx

Note that there's no newline for the above code, use perl -le if you want one.

Solution 5:[5]

Is there a better way to repeat a char n times and assign the result to a variable?

I propose following solution limited to case where n is equal to 2^x and x is integer equal or greater 1. For example to get x repeated 32 i.e. 2^5 times

awk 'BEGIN{s="x";for(i=1;i<=5;i+=1)s=s s;print s}' emptyfile.txt

output

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

(tested in gawk 4.2.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3
Solution 4 Sundeep
Solution 5 Daweo