'Iterate through True Bits of Binary Number (Fast)

So I'm trying to make a function in Python to return all of the powers of two used in the binary of a number.

For example: 123 in binary is 1111011. I want my function to return the powers of two corresponding to the True bits of 123 (1, 2, 8, 16, 32 and 64) as fast as possible.

Right now the best I've found is:

def true_bits(num):
    while num:
        temp = num & -num
        num -= temp
        yield temp


Solution 1:[1]

Some (non-faster) alternatives:

Using numpy and assuming 8bits unsigned integers:

import numpy as np
def numpy_bits(num):
    bits = np.unpackbits(np.uint8(num), bitorder='little')
    return 2**np.arange(8)[bits.astype(bool)]

numpy_bits(123)
# array([ 1,  2,  8, 16, 32, 64])
# 6.8 µs ± 293 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using a python loop (decreasing bit order):

def python_bits(num):
    for i in range(7,-1,-1):
        if num >= (x:=2**i):
            yield x
            num -= x

list(python_bits(123))
# [64, 32, 16, 8, 2, 1]
# 2.26 µs ± 61.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

OP's approach:

list(true_bits(123))
# [1, 2, 8, 16, 32, 64]
# 1.14 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Solution 2:[2]

Benchmark with a bunch of random 64-bit numbers with 24 true bits (based on what you said in the comments):

47.7 ms  true_bits_original
16.0 ms  true_bits_Kelly

45.6 ms  true_bits_original
15.7 ms  true_bits_Kelly

47.4 ms  true_bits_original
16.3 ms  true_bits_Kelly

I'm using eight lookup-tables, each responsible for eight of the bits. Full code with benchmark (Try it online!):

intern = {2**i: 2**i for i in range(64)}.get
table0 = [()]
for i in range(8):
    table0 += [(*bits, intern(2**i)) for bits in table0]
table0 = tuple(table0)
table1 = tuple(tuple(intern(bit << 8) for bit in bits) for bits in table0)
table2 = tuple(tuple(intern(bit << 8) for bit in bits) for bits in table1)
table3 = tuple(tuple(intern(bit << 8) for bit in bits) for bits in table2)
table4 = tuple(tuple(intern(bit << 8) for bit in bits) for bits in table3)
table5 = tuple(tuple(intern(bit << 8) for bit in bits) for bits in table4)
table6 = tuple(tuple(intern(bit << 8) for bit in bits) for bits in table5)
table7 = tuple(tuple(intern(bit << 8) for bit in bits) for bits in table6)

def true_bits_Kelly(num):
    return chain(table0[num & 0xff],
                 table1[(num >> 8) & 0xff],
                 table2[(num >> 16) & 0xff],
                 table3[(num >> 24) & 0xff],
                 table4[(num >> 32) & 0xff],
                 table5[(num >> 40) & 0xff],
                 table6[(num >> 48) & 0xff],
                 table7[num >> 56])

def true_bits_original(num):
    while num:
        temp = num & -num
        num -= temp
        yield temp

funcs = true_bits_original, true_bits_Kelly

import timeit
from itertools import repeat, chain
from random import getrandbits, sample
from collections import deque

# Correctness
for _ in range(1000):
    num = getrandbits(64)
    expect = list(funcs[0](num))
    for func in funcs:
        assert list(func(num)) == expect

# Speed
for _ in range(3):
    nums = [sum(2**i for i in sample(range(64), 24))
            for _ in range(10000)]
    for func in funcs:
        def run():
            gens = map(func, nums)
            consumes = map(deque, gens, repeat(0))
            deque(consumes, 0)
        t = min(timeit.repeat(run, number=1))
        print('%4.1f ms ' % (t * 1e3), func.__name__)
    print()

Solution 3:[3]

You initial code is already pretty efficient. The thing is the CPython interpreter make it slow.

Indeed, the interpreter use reference-counted variable-sized integers that are expensive to manage. Thus, -num allocates a new integer object as well as num & ... and num -= temp. This means 3 expensive allocations are done. The yield is also a quite expensive operation (it cause a low-level context-switch).

Such overhead can be mostly removed by just-in-time compilers (JIT). The PyPy general-purpose JIT-based interpreter for example is able to mostly remove the overhead of object allocations (thanks also to a fast garbage collector) though PyPy does not yet optimize the yield very well. Alternatively, Numba can be used here. Numba is a JIT compiler meant to optimize numerical codes that can be used in code executed by CPython. For example the following code is a bit faster:

import numba as nb

@nb.njit('(uint64,)')
def true_bits(num):
    while num:
        temp = num & -num
        num -= temp
        yield temp

That being said, it is limited to 64-bit integers (or smaller) like Numpy does. Cython could also help by compiling the code ahead of time using a basic compiler. This is similar to writing your own C module expect you do not need to write C code and Cython make this process much easier.

If you want to optimize the code further, then you certainly need to use such tools in the caller function so not to pay the expensive overhead of function calls from the CPython interpreter (which are at least 10 times slower than native ones).


If this is not possible (hopeless situation), you can use the following approach with Numba:

@nb.njit('(uint64,uint64[::1])')
def true_bits(num, buffer):
    cur = 0
    buffer.fill(0)
    while num:
        temp = num & -num
        num -= temp
        buffer[cur] = temp
        cur += 1
    return cur

buffer = np.empty(64, dtype=np.uint64)
written_items = true_bits(154781, buffer)
# Result stored in in buffer[:written_items]

The idea is to write the result in a pre-allocated buffer (since creating Numpy arrays is slow in this context). Then the function writes the value in the buffer when needed and returns the number of written items. You can get the actual items with buffer[:written_items] or you can iterate on the array but be aware that doing this is almost as expensive as the computation itself (again due to the CPython interpreter). Still, it is faster than the initial solution.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3