'Go CountBits exercise: n strings slices performing a lot better than log2n math package calls

I'm trying to solve an exercise where you have to count the number of '1' bits of any number. I came up with 4 main ideas:

  1. Make a recursive function countBitsString, where at each iteration it gets the string corresponding to the base2 number removing the first character. If this is 1, increment a counter and continue.
  2. The same as above but with a pointer to that string instead of the value (TODO)
  3. Use logarithm logic to understand how many power of 2 are inside this number, for each of them we have a '1' bit up so increment the counter. Increment one more if the number is odd (countBitsSmart)
  4. Work with binary logic to check the bit and then shift left (TODO)

Now, here's my current code and later the results

package main

import (
    "fmt"
    "strconv"
    "math/rand"
    "math"
    "time"
)

/*
 * implement a function that counts the number of set of bits in binary representation
 * Complete the 'countBits' function below.
 * The function is expected to return an int32.
 * The function accepts unit32 num as parameter.
*/
func countBitsString(num uint32) int32 {

    num_64     := int64(num)
    num_2      := strconv.FormatInt(num_64, 2)
    _, counter := recursiveCount(num_2, 0)
    return counter
}

func recursiveCount(s string, counter int32) (string, int32) {
    if len(s) < 1 {
        return "", counter
        }
    
    if s[:1] == "1" {
        counter+=1
    }
    //fmt.Printf("Current counter: %d\n", counter)
    //fmt.Printf("Current char %s\n", s[:1])
    
    return recursiveCount(s[1:], counter)
}


func countBitsBinary(num uint32) int32 {
    fmt.Println("TODO")
    // convert uint23 in binary form
    // for binary_number > 0 
    //      if current bit == 1 => counter++
    //      shift left binary_number
    return 0
}

func countBitsSmart(num uint32) int32{
    var upBits          int32
    var highestPower    uint32 = math.MaxUint32

    if num % 2 >0  {
        upBits+=1
        // fmt.Printf("\tUP bits = %d\n", upBits)
    }
                                                                            
    for ; (num > 2 && highestPower > 1); upBits++ {                         
        highestPower = uint32(math.Log2(float64(num)))                      
        num          = num - uint32(math.Pow(2, float64(highestPower)))     
        // fmt.Printf("\tlog2 = %d\n", highestPower)                            
        // fmt.Printf("\tnum %d rest %d\n", num, num%2)
        // fmt.Printf("\tUP bits = %d\n", upBits)
    }
    
    return upBits
}

// Profiling with execution time and calling
func invoker(numInput []uint32, funcName string, countFunc func(num uint32) int32) {
    fmt.Println("\n=========================================================")
    start := time.Now()

    for _, numTemp := range numInput{
        // fmt.Printf("> Number:  %d (%s)\tUp bits: %d\n", numTemp, strconv.FormatInt(int64(numTemp), 2), countFunc( uint32(numTemp) ))
        countFunc( uint32(numTemp))
    }
    fmt.Printf("Time elapsed for %v func= %vs", funcName, time.Since(start))
}

func main() {
    const   testSize = 10000000
    var     numInput = make([]uint32, testSize)
    for i:=0; i<testSize; i++ {
        numInput[i] = uint32(rand.Intn(500))
    }
    
    fmt.Printf("\n Test size = %d\n", testSize)

    invoker(numInput, "countBitsString", countBitsString)
    
    invoker(numInput, "countBitsSmart", countBitsSmart)

    fmt.Println()
}

enter image description here

How can the second function perform 5 times worse than the previous one? Is it just for the calls to math package?

Thanks



Solution 1:[1]

Using floating point Log2/Pow calculations is a very expensive way to calculate bits. Converting to a binary string is much faster since it can be done with bit shifts/integer calculations (it's still expensive). The Wikipedia Hamming Weight page has some explanations of very fast popcount calculations via bitmasks.

You can see the relative performance via Go's builtin benchmark support. It can help understand how code performs:

// main_test.go
package main

import (
    "math/bits"
    "testing"
)   

// Simple benchmarks to get started.
// You could also try benchmarking different numbers to understand 
// performance differences. There is some inherit bias in benchmarking
// sequential numbers starting from 0. You could try a list of 256 preset
// random numbers. There are many options..

func BenchmarkCountBitsString(b *testing.B) {
    for i := 0; i < b.N; i++ {
        dummyInt32 = countBitsString(uint32(i))
    }
}

func BenchmarkCountBitsSmart(b *testing.B) {
    for i := 0; i < b.N; i++ {
        dummyInt32 = countBitsSmart(uint32(i))
    }
}

func BenchmarkOnesCount(b *testing.B) {
    for i := 0; i < b.N; i++ {
        dummyInt = bits.OnesCount32(uint32(i))
    }
}   

// Ensure some optimisations don't occur by assigning to a global.
var (
    dummyInt   int
    dummyInt32 int32
)   

Then you can easily understand the relative performance:

$ go test -bench .
goos: linux
goarch: amd64
pkg: stack/bench
cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkCountBitsString-8      10388419            99.97 ns/op
BenchmarkCountBitsSmart-8        2036588           617.6 ns/op
BenchmarkOnesCount-8            1000000000           0.5540 ns/op
PASS
ok      stack/bench 3.625s

Inspecting performance with pprof requires some extra parameters:

  • Disable tests with -run XXX (XXX doesn't match any tests)
  • Limit to a single function benchmark, otherwise the faster function will be run more often until it consumes as much time as the slower function (~5s in the final run).
  • Use a fixed number of iterations for the benchmark time so the profiles are comparable between functions
  • Save CPU profile. This also causes the test binary to be kept as DIR.test (bench.test in this example).

Benchmark details for countBitsSmart show it spends ~55% time in math.Log2 and ~38% time in math.Pow:

$ go test -bench CountBitsSmart -run XXX -benchtime 10000000x -cpuprofile cpu.prof
goos: linux
goarch: amd64
pkg: stack/bench
cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkCountBitsSmart-8       10000000           703.0 ns/op
PASS
ok      stack/bench 7.226s
$ go tool pprof -top -cum bench.test cpu.prof 
File: bench.test
Type: cpu
Time: Apr 17, 2022 at 11:19pm (AEST)
Duration: 7.22s, Total samples = 7.01s (97.10%)
Showing nodes accounting for 7s, 99.86% of 7.01s total
Dropped 1 node (cum <= 0.04s)
      flat  flat%   sum%        cum   cum%
     0.01s  0.14%  0.14%      7.01s   100%  stack/bench.BenchmarkCountBitsSmart
         0     0%  0.14%      7.01s   100%  testing.(*B).launch
         0     0%  0.14%      7.01s   100%  testing.(*B).runN
     0.43s  6.13%  6.28%         7s 99.86%  stack/bench.countBitsSmart
     0.07s     1%  7.28%      3.88s 55.35%  math.Log2 (inline)
     0.42s  5.99% 13.27%      3.81s 54.35%  math.log2
     0.02s  0.29% 13.55%      3.07s 43.79%  math.Log (inline)
     3.05s 43.51% 57.06%      3.05s 43.51%  math.archLog
     0.15s  2.14% 59.20%      2.69s 38.37%  math.Pow (inline)
     1.20s 17.12% 76.32%      2.54s 36.23%  math.pow
     0.07s     1% 77.32%      0.57s  8.13%  math.Frexp (inline)
[...]
$ go tool pprof -list bench.countBitsSmart bench.test cpu.prof 
Total: 7.01s
ROUTINE ======================== stack/bench.countBitsSmart in /home/.../stack/bench/main.go
     430ms         7s (flat, cum) 99.86% of Total
         .          .     54:   if num%2 > 0 {
         .          .     55:       upBits += 1
         .          .     56:       // fmt.Printf("\tUP bits = %d\n", upBits)
         .          .     57:   }
         .          .     58:
      40ms       40ms     59:   for ; num > 2 && highestPower > 1; upBits++ {
     250ms      4.13s     60:       highestPower = uint32(math.Log2(float64(num)))
      90ms      2.78s     61:       num = num - uint32(math.Pow(2, float64(highestPower)))
         .          .     62:       // fmt.Printf("\tlog2 = %d\n", highestPower)
         .          .     63:       // fmt.Printf("\tnum %d rest %d\n", num, num%2)
         .          .     64:       // fmt.Printf("\tUP bits = %d\n", upBits)
         .          .     65:   }
         .          .     66:
      50ms       50ms     67:   return upBits
         .          .     68:}
         .          .     69:
         .          .     70:// Profiling with execution time and calling
         .          .     71:func invoker(numInput []uint32, funcName string, countFunc func(num uint32) int32) {
         .          .     72:   fmt.Println("\n=========================================================")

The countBitsString benchmark shows about half the time is spent in recursiveCount and half in strconv.FormatInt. This indicates recursiveCount could be a good target for optimisation.

$ go test -bench CountBitsString -run XXX -benchtime 10000000x -cpuprofile cpu.profgoos: linux
goarch: amd64
pkg: stack/bench
cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkCountBitsString-8      10000000            99.13 ns/op
PASS
ok      stack/bench 1.111s
$ go tool pprof -top -cum bench.test cpu.prof 
File: bench.test
Type: cpu
Time: Apr 17, 2022 at 11:15pm (AEST)
Duration: 1.10s, Total samples = 1.05s (95.18%)
Showing nodes accounting for 1.05s, 100% of 1.05s total
      flat  flat%   sum%        cum   cum%
     0.02s  1.90%  1.90%      1.01s 96.19%  stack/bench.BenchmarkCountBitsString
         0     0%  1.90%      1.01s 96.19%  testing.(*B).launch
         0     0%  1.90%      1.01s 96.19%  testing.(*B).runN
     0.01s  0.95%  2.86%      0.99s 94.29%  stack/bench.countBitsString
     0.51s 48.57% 51.43%      0.51s 48.57%  stack/bench.recursiveCount
         0     0% 51.43%      0.47s 44.76%  strconv.FormatInt
     0.21s 20.00% 71.43%      0.47s 44.76%  strconv.formatBits
     0.03s  2.86% 74.29%      0.26s 24.76%  runtime.slicebytetostring
     0.14s 13.33% 87.62%      0.22s 20.95%  runtime.mallocgc
     0.04s  3.81% 91.43%      0.04s  3.81%  runtime.nextFreeFast (inline)
[...]
$ go tool pprof -list bench.countBitsString bench.test cpu.prof 
Total: 1.05s
ROUTINE ======================== stack/bench.countBitsString in /home/.../stack/bench/main.go
      10ms      990ms (flat, cum) 94.29% of Total
         .          .     14: * implement a function that counts the number of set of bits in binary representation
         .          .     15: * Complete the 'countBits' function below.
         .          .     16: * The function is expected to return an int32.
         .          .     17: * The function accepts unit32 num as parameter.
         .          .     18: */
      10ms       10ms     19:func countBitsString(num uint32) int32 {
         .          .     20:
         .          .     21:   num_64 := int64(num)
         .      470ms     22:   num_2 := strconv.FormatInt(num_64, 2)
         .      510ms     23:   _, counter := recursiveCount(num_2, 0)
         .          .     24:   return counter
         .          .     25:}
         .          .     26:
         .          .     27:func recursiveCount(s string, counter int32) (string, int32) {
         .          .     28:   if len(s) < 1 {

These benchmarks show math.Log2+math.Pow (6.57s) is ~14x slower than strconv.FormatInt (0.47s).

I also highly recommend using the pprof web interface to interactively explore performance. This is a little non-obvious, but can be started with:

$ go tool pprof -http : bench.test cpu.prof

Dave Cheney also has an instructive blog on benchmarking with Go: https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1