'Speed up bitstring/bit operations in Python?

I wrote a prime number generator using Sieve of Eratosthenes and Python 3.1. The code runs correctly and gracefully at 0.32 seconds on ideone.com to generate prime numbers up to 1,000,000.

# from bitstring import BitString

def prime_numbers(limit=1000000):
    '''Prime number generator. Yields the series
    2, 3, 5, 7, 11, 13, 17, 19, 23, 29 ...    
    using Sieve of Eratosthenes.
    '''
    yield 2
    sub_limit = int(limit**0.5) 
    flags = [False, False] + [True] * (limit - 2)   
#     flags = BitString(limit)
    # Step through all the odd numbers
    for i in range(3, limit, 2):       
        if flags[i] is False:
#        if flags[i] is True:
            continue
        yield i
        # Exclude further multiples of the current prime number
        if i <= sub_limit:
            for j in range(i*3, limit, i<<1):
                flags[j] = False
#                flags[j] = True

The problem is, I run out of memory when I try to generate numbers up to 1,000,000,000.

    flags = [False, False] + [True] * (limit - 2)   
MemoryError

As you can imagine, allocating 1 billion boolean values (1 byte 4 or 8 bytes (see comment) each in Python) is really not feasible, so I looked into bitstring. I figured, using 1 bit for each flag would be much more memory-efficient. However, the program's performance dropped drastically - 24 seconds runtime, for prime number up to 1,000,000. This is probably due to the internal implementation of bitstring.

You can comment/uncomment the three lines to see what I changed to use BitString, as the code snippet above.

My question is, is there a way to speed up my program, with or without bitstring?

Edit: Please test the code yourself before posting. I can't accept answers that run slower than my existing code, naturally.

Edit again:

I've compiled a list of benchmarks on my machine.



Solution 1:[1]

OK, so this is my second answer, but as speed is of the essence I thought that I had to mention the bitarray module - even though it's bitstring's nemesis :). It's ideally suited to this case as not only is it a C extension (and so faster than pure Python has a hope of being), but it also supports slice assignments.

I haven't even tried to optimise this, I just rewrote the bitstring version. On my machine I get 0.16 seconds for primes under a million.

For a billion, it runs perfectly well and completes in 2 minutes 31 seconds.

import bitarray

def prime_bitarray(limit=1000000):
    yield 2
    flags = bitarray.bitarray(limit)
    flags.setall(False)
    sub_limit = int(limit**0.5)
    for i in range(3, limit, 2):
        if not flags[i]:
            yield i
            if i <= sub_limit:
                flags[3*i:limit:i*2] = True

Solution 2:[2]

Okay, here's a (near complete) comprehensive benchmarking I've done tonight to see which code runs the fastest. Hopefully someone will find this list useful. I omitted anything that takes more than 30 seconds to complete on my machine.

I would like to thank everyone that put in an input. I've gained a lot of insight from your efforts, and I hope you have too.

My machine: AMD ZM-86, 2.40 Ghz Dual-Core, with 4GB of RAM. This is a HP Touchsmart Tx2 laptop. Note that while I may have linked to a pastebin, I benchmarked all of the following on my own machine.

I will add the gmpy2 benchmark once I am able to build it.

All of the benchmarks are tested in Python 2.6 x86

Returning a list of prime numbers n up to 1,000,000: (Using Python generators)

Sebastian's numpy generator version (updated) - 121 ms @

Mark's Sieve + Wheel - 154 ms

Robert's version with slicing - 159 ms

My improved version with slicing - 205 ms

Numpy generator with enumerate - 249 ms @

Mark's Basic Sieve - 317 ms

casevh's improvement on my original solution - 343 ms

My modified numpy generator solution - 407 ms

My original method in the question - 409 ms

Bitarray Solution - 414 ms @

Pure Python with bytearray - 1394 ms @

Scott's BitString solution - 6659 ms @

'@' means this method is capable of generating up to n < 1,000,000,000 on my machine setup.

In addition, if you don't need the generator and just want the whole list at once:

numpy solution from RosettaCode - 32 ms @

(The numpy solution is also capable of generating up to 1 billion, which took 61.6259 seconds. I suspect the memory was swapped once, hence the double time.)

Solution 3:[3]

Related question: Fastest way to list all primes below N in python.

Hi, i am too looking for a code in Python to generate primes up to 10^9 as fast as i can, which is difficult because of the memory problem. up to now i came up with this to generate primes up to 10^6 & 10^7 (clocking 0,171s & 1,764s respectively on my old machine), which seems to be slightly faster (at least in my computer) than "My improved version with slicing" from previous post, probably because i use n//i-i +1 (see below) instead of the analogous len(flags[i2::i<<1]) in the other code. please correct me if i am wrong. Any suggestions for improvement are very welcome.

def primes(n):
    """ Returns  a list of primes < n """
    sieve = [True] * n
    for i in xrange(3,int(n**0.5)+1,2):
        if sieve[i]:
            sieve[i*i::2*i]=[False]*((n-i*i-1)/(2*i)+1)
    return [2] + [i for i in xrange(3,n,2) if sieve[i]]

In one of his codes Xavier uses flags[i2::i<<1] and len(flags[i2::i<<1]). I computed the size explicitly, using the fact that between i*i..n we have (n-i*i)//2*i multiples of 2*i because we want to count i*i also we sum 1 so len(sieve[i*i::2*i]) equals (n-i*i)//(2*i) +1. This makes the code faster. A basic generator based on the code above would be:

def primesgen(n):
    """ Generates all primes <= n """
    sieve = [True] * n
    yield 2
    for i in xrange(3,int(n**0.5)+1,2):
        if sieve[i]:
            yield i
            sieve[i*i::2*i] = [False]*((n-i*i-1)/(2*i)+1)
    for i in xrange(i+2,n,2):
        if sieve[i]: yield i

with a bit of modification one can write a slightly slower version of the code above that starts with a sieve half of the size sieve = [True] * (n//2) and works for the same n. not sure how that will reflect in the memory issue. As an example of implementation here is the modified version of the numpy rosetta code (maybe faster) starting with a sieve half of the size.

import numpy
def primesfrom3to(n):
    """ Returns a array of primes, 3 <= p < n """
    sieve = numpy.ones(n/2, dtype=numpy.bool)
    for i in xrange(3,int(n**0.5)+1,2):
        if sieve[i/2]: sieve[i*i/2::i] = False
    return 2*numpy.nonzero(sieve)[0][1::]+1

A Faster & more memory-wise generator would be:

import numpy
def primesgen1(n):
""" Input n>=6, Generates all primes < n """
sieve1 = numpy.ones(n/6+1, dtype=numpy.bool)
sieve5 = numpy.ones(n/6  , dtype=numpy.bool)
sieve1[0] = False
yield 2; yield 3;
for i in xrange(int(n**0.5)/6+1):
    if sieve1[i]:
        k=6*i+1; yield k;
        sieve1[ ((k*k)/6) ::k] = False
        sieve5[(k*k+4*k)/6::k] = False
    if sieve5[i]:
        k=6*i+5; yield k;
        sieve1[ ((k*k)/6) ::k] = False
        sieve5[(k*k+2*k)/6::k] = False
for i in xrange(i+1,n/6):
        if sieve1[i]: yield 6*i+1
        if sieve5[i]: yield 6*i+5
if n%6 > 1:
    if sieve1[i+1]: yield 6*(i+1)+1

or with a bit more code:

import numpy
def primesgen(n):
    """ Input n>=30, Generates all primes < n """
    size = n/30 + 1
    sieve01 = numpy.ones(size, dtype=numpy.bool)
    sieve07 = numpy.ones(size, dtype=numpy.bool)
    sieve11 = numpy.ones(size, dtype=numpy.bool)
    sieve13 = numpy.ones(size, dtype=numpy.bool)
    sieve17 = numpy.ones(size, dtype=numpy.bool)
    sieve19 = numpy.ones(size, dtype=numpy.bool)
    sieve23 = numpy.ones(size, dtype=numpy.bool)
    sieve29 = numpy.ones(size, dtype=numpy.bool)
    sieve01[0] = False
    yield 2; yield 3; yield 5;
    for i in xrange(int(n**0.5)/30+1):
        if sieve01[i]:
            k=30*i+1; yield k;
            sieve01[     (k*k)/30::k] = False
            sieve07[(k*k+ 6*k)/30::k] = False
            sieve11[(k*k+10*k)/30::k] = False
            sieve13[(k*k+12*k)/30::k] = False
            sieve17[(k*k+16*k)/30::k] = False
            sieve19[(k*k+18*k)/30::k] = False
            sieve23[(k*k+22*k)/30::k] = False
            sieve29[(k*k+28*k)/30::k] = False
        if sieve07[i]:
            k=30*i+7; yield k;
            sieve01[(k*k+ 6*k)/30::k] = False
            sieve07[(k*k+24*k)/30::k] = False
            sieve11[(k*k+16*k)/30::k] = False
            sieve13[(k*k+12*k)/30::k] = False
            sieve17[(k*k+ 4*k)/30::k] = False
            sieve19[     (k*k)/30::k] = False
            sieve23[(k*k+22*k)/30::k] = False
            sieve29[(k*k+10*k)/30::k] = False
        if sieve11[i]:
            k=30*i+11; yield k;
            sieve01[     (k*k)/30::k] = False
            sieve07[(k*k+ 6*k)/30::k] = False
            sieve11[(k*k+20*k)/30::k] = False
            sieve13[(k*k+12*k)/30::k] = False
            sieve17[(k*k+26*k)/30::k] = False
            sieve19[(k*k+18*k)/30::k] = False
            sieve23[(k*k+ 2*k)/30::k] = False
            sieve29[(k*k+ 8*k)/30::k] = False
        if sieve13[i]:
            k=30*i+13; yield k;
            sieve01[(k*k+24*k)/30::k] = False
            sieve07[(k*k+ 6*k)/30::k] = False
            sieve11[(k*k+ 4*k)/30::k] = False
            sieve13[(k*k+18*k)/30::k] = False
            sieve17[(k*k+16*k)/30::k] = False
            sieve19[     (k*k)/30::k] = False
            sieve23[(k*k+28*k)/30::k] = False
            sieve29[(k*k+10*k)/30::k] = False
        if sieve17[i]:
            k=30*i+17; yield k;
            sieve01[(k*k+ 6*k)/30::k] = False
            sieve07[(k*k+24*k)/30::k] = False
            sieve11[(k*k+26*k)/30::k] = False
            sieve13[(k*k+12*k)/30::k] = False
            sieve17[(k*k+14*k)/30::k] = False
            sieve19[     (k*k)/30::k] = False
            sieve23[(k*k+ 2*k)/30::k] = False
            sieve29[(k*k+20*k)/30::k] = False
        if sieve19[i]:
            k=30*i+19; yield k;
            sieve01[     (k*k)/30::k] = False
            sieve07[(k*k+24*k)/30::k] = False
            sieve11[(k*k+10*k)/30::k] = False
            sieve13[(k*k+18*k)/30::k] = False
            sieve17[(k*k+ 4*k)/30::k] = False
            sieve19[(k*k+12*k)/30::k] = False
            sieve23[(k*k+28*k)/30::k] = False
            sieve29[(k*k+22*k)/30::k] = False
        if sieve23[i]:
            k=30*i+23; yield k;
            sieve01[(k*k+24*k)/30::k] = False
            sieve07[(k*k+ 6*k)/30::k] = False
            sieve11[(k*k+14*k)/30::k] = False
            sieve13[(k*k+18*k)/30::k] = False
            sieve17[(k*k+26*k)/30::k] = False
            sieve19[     (k*k)/30::k] = False
            sieve23[(k*k+ 8*k)/30::k] = False
            sieve29[(k*k+20*k)/30::k] = False
        if sieve29[i]:
            k=30*i+29; yield k;
            sieve01[     (k*k)/30::k] = False
            sieve07[(k*k+24*k)/30::k] = False
            sieve11[(k*k+20*k)/30::k] = False
            sieve13[(k*k+18*k)/30::k] = False
            sieve17[(k*k+14*k)/30::k] = False
            sieve19[(k*k+12*k)/30::k] = False
            sieve23[(k*k+ 8*k)/30::k] = False
            sieve29[(k*k+ 2*k)/30::k] = False
    for i in xrange(i+1,n/30):
            if sieve01[i]: yield 30*i+1
            if sieve07[i]: yield 30*i+7
            if sieve11[i]: yield 30*i+11
            if sieve13[i]: yield 30*i+13
            if sieve17[i]: yield 30*i+17
            if sieve19[i]: yield 30*i+19
            if sieve23[i]: yield 30*i+23
            if sieve29[i]: yield 30*i+29
    i += 1
    mod30 = n%30
    if mod30 > 1:
        if sieve01[i]: yield 30*i+1
    if mod30 > 7:
        if sieve07[i]: yield 30*i+7
    if mod30 > 11:
        if sieve11[i]: yield 30*i+11
    if mod30 > 13:
        if sieve13[i]: yield 30*i+13
    if mod30 > 17:
        if sieve17[i]: yield 30*i+17
    if mod30 > 19:
        if sieve19[i]: yield 30*i+19
    if mod30 > 23:
        if sieve23[i]: yield 30*i+23

Ps: If you have any suggestions about how to speed up this code, anything from changing the order of operations to pre-computing anything, please comment.

Solution 4:[4]

Here's a version that I wrote a while back; it might be interesting to compare with yours for speed. It doesn't do anything about the space problems, though (in fact, they're probably worse than with your version).

from math import sqrt

def basicSieve(n):
    """Given a positive integer n, generate the primes < n."""
    s = [1]*n
    for p in xrange(2, 1+int(sqrt(n-1))):
        if s[p]:
            a = p*p
            s[a::p] = [0] * -((a-n)//p)
    for p in xrange(2, n):
        if s[p]:
            yield p 

I have faster versions, using a wheel, but they're much more complicated. Original source is here.

Okay, here's the version using a wheel. wheelSieve is the main entry point.

from math import sqrt
from bisect import bisect_left

def basicSieve(n):
    """Given a positive integer n, generate the primes < n."""
    s = [1]*n
    for p in xrange(2, 1+int(sqrt(n-1))):
        if s[p]:
            a = p*p
            s[a::p] = [0] * -((a-n)//p)
    for p in xrange(2, n):
        if s[p]:
            yield p

class Wheel(object):
    """Class representing a wheel.

    Attributes:
       primelimit -> wheel covers primes < primelimit.
       For example, given a primelimit of 6
       the wheel primes are 2, 3, and 5.
       primes -> list of primes less than primelimit
       size -> product of the primes in primes;  the modulus of the wheel
       units -> list of units modulo size
       phi -> number of units

    """
    def __init__(self, primelimit):
        self.primelimit = primelimit
        self.primes = list(basicSieve(primelimit))

        # compute the size of the wheel
        size = 1
        for p in self.primes:
            size *= p
        self.size = size

        # find the units by sieving
        units = [1] * self.size
        for p in self.primes:
            units[::p] = [0]*(self.size//p)
        self.units = [i for i in xrange(self.size) if units[i]]

        # number of units
        self.phi = len(self.units)

    def to_index(self, n):
        """Compute alpha(n), where alpha is an order preserving map
        from the set of units modulo self.size to the nonnegative integers.

        If n is not a unit, the index of the first unit greater than n
        is given."""
        return bisect_left(self.units, n % self.size) + n // self.size * self.phi

    def from_index(self, i):
        """Inverse of to_index."""

        return self.units[i % self.phi] + i // self.phi * self.size

def wheelSieveInner(n, wheel):
    """Given a positive integer n and a wheel, find the wheel indices of
    all primes that are less than n, and that are units modulo the
    wheel modulus.
    """

    # renaming to avoid the overhead of attribute lookups
    U = wheel.units
    wS = wheel.size
    # inverse of unit map
    UI = dict((u, i) for i, u in enumerate(U))
    nU = len(wheel.units)

    sqroot = 1+int(sqrt(n-1)) # ceiling of square root of n

    # corresponding index (index of next unit up)
    sqrti = bisect_left(U, sqroot % wS) + sqroot//wS*nU
    upper = bisect_left(U, n % wS) + n//wS*nU
    ind2 = bisect_left(U, 2 % wS) + 2//wS*nU

    s = [1]*upper
    for i in xrange(ind2, sqrti):
        if s[i]:
            q = i//nU
            u = U[i%nU]
            p = q*wS+u
            u2 = u*u
            aq, au = (p+u)*q+u2//wS, u2%wS
            wp = p * nU
            for v in U:
                # eliminate entries corresponding to integers
                # congruent to p*v modulo p*wS
                uvr = u*v%wS
                m = aq + (au > uvr)
                bot = (m + (q*v + u*v//wS - m) % p) * nU + UI[uvr]
                s[bot::wp] = [0]*-((bot-upper)//wp)
    return s

def wheelSieve(n, wheel=Wheel(10)):
    """Given a positive integer n, generate the list of primes <= n."""
    n += 1
    wS = wheel.size
    U = wheel.units
    nU = len(wheel.units)
    s = wheelSieveInner(n, wheel)
    return (wheel.primes[:bisect_left(wheel.primes, n)] +
            [p//nU*wS + U[p%nU] for p in xrange(bisect_left(U, 2 % wS)
             + 2//wS*nU, len(s)) if s[p]])

As to what a wheel is: well, you know that (apart from 2), all primes are odd, so most sieves miss out all the even numbers. Similarly, you can go a bit further and notice that all primes (except 2 and 3) are congruent to 1 or 5 modulo 6 (== 2 * 3), so you can get away with only storing entries for those numbers in your sieve. The next step up is to note that all primes (except 2, 3 and 5) are congruent to one of 1, 7, 11, 13, 17, 19, 23, 29 (modulo 30) (here 30 == 2*3*5), and so on.

Solution 5:[5]

One speed improvement you can make using bitstring is to use the 'set' method when you exclude multiples of the current number.

So the vital section becomes

for i in range(3, limit, 2):
    if flags[i]:
        yield i
        if i <= sub_limit:
            flags.set(1, range(i*3, limit, i*2))    

On my machine this runs about 3 times faster than the original.

My other thought was to use the bitstring to represent only the odd numbers. You could then find the unset bits using flags.findall([0]) which returns a generator. Not sure if that would be much faster (finding bit patterns isn't terribly easy to do efficiently).

[Full disclosure: I wrote the bitstring module, so I've got some pride at stake here!]


As a comparison I've also taken the guts out of the bitstring method so that it's doing it in the same way, but without range checking etc. I think this gives a reasonable lower limit for pure Python that works for a billion elements (without changing the algorithm, which I think is cheating!)

def prime_pure(limit=1000000):
    yield 2
    flags = bytearray((limit + 7) // 8)
    sub_limit = int(limit**0.5)
    for i in xrange(3, limit, 2):
        byte, bit = divmod(i, 8)
        if not flags[byte] & (128 >> bit):
            yield i
            if i <= sub_limit:
                for j in xrange(i*3, limit, i*2):
                    byte, bit = divmod(j, 8)
                    flags[byte] |= (128 >> bit)

On my machine this runs in about 0.62 seconds for a million elements, which means it's about a quarter of the speed of the bitarray answer. I think that's quite reasonable for pure Python.

Solution 6:[6]

Python's integer types can be of arbitrary size, so you shouldn't need a clever bitstring library to represent a set of bits, just a single number.

Here's the code. It handles a limit of 1,000,000 with ease, and can even handle 10,000,000 without complaining too much:

def multiples_of(n, step, limit):
    bits = 1 << n
    old_bits = bits
    max = 1 << limit
    while old_bits < max:
        old_bits = bits
        bits += bits << step
        step *= 2
    return old_bits

def prime_numbers(limit=10000000):
    '''Prime number generator. Yields the series                                                                        
    2, 3, 5, 7, 11, 13, 17, 19, 23, 29 ...                                                                              
    using Sieve of Eratosthenes.                                                                                        
    '''
    yield 2
    sub_limit = int(limit**0.5)
    flags = ((1 << (limit - 2)) - 1) << 2
    # Step through all the odd numbers                                                                                  
    for i in xrange(3, limit, 2):
        if not (flags & (1 << i)):
            continue
        yield i
        # Exclude further multiples of the current prime number                                                         
        if i <= sub_limit:
            flags &= ~multiples_of(i * 3, i << 1, limit)

The multiples_of function avoids the cost of setting every single multiple individually. It sets the initial bit, shifts it by the initial step (i << 1) and adds the result to itself. It then doubles the step, so that the next shift copies both bits, and so on until it reaches the limit. This sets all the multiples of a number up to the limit in O(log(limit)) time.

Solution 7:[7]

One option you may want to look at is just compiling a C/C++ module so you have direct access to the bit-twiddling features. AFAIK you could write something of that nature and only return the results on completion of the calculations performed in C/C++. Now that I'm typing this you may also look at numpy which does store arrays as native C for speed. I don't know if that will be any faster than the bitstring module, though!

Solution 8:[8]

Another really stupid option, but that can be of help if you really need a large list of primes number available very fast. Say, if you need them as a tool to answer project Euler's problems (if the problem is just optimizing your code as a mind game it's irrelevant).

Use any slow solution to generate list and save it to a python source file, says primes.py that would just contain:

primes = [ a list of a million primes numbers here ]

Now to use them you just have to do import primes and you get them with minimal memory footprint at the speed of disk IO. Complexity is number of primes :-)

Even if you used a poorly optimized solution to generate this list, it will only be done once and it does not matter much.

You could probably make it even faster using pickle/unpickle, but you get the idea...

Solution 9:[9]

You could use a segmented Sieve of Eratosthenes. The memory used for each segment is reused for the next segment.

Solution 10:[10]

Here is some Python3 code that uses less memory than the bitarray/bitstring solutions posted to date and about 1/8 the memory of Robert William Hanks's primesgen(), while running faster than primesgen() (marginally faster at 1,000,000, using 37KB of memory, 3x faster than primesgen() at 1,000,000,000 using under 34MB). Increasing the chunk size (variable chunk in the code) uses more memory but speeds up the program, up to a limit -- I picked a value so that its contribution to memory is under about 10% of sieve's n//30 bytes. It's not as memory-efficient as Will Ness's infinite generator (see also https://stackoverflow.com/a/19391111/5439078) because it records (and returns at the end, in compressed form) all calculated primes.

This should work correctly as long as the square root calculation comes out accurate (about 2**51 if Python uses 64-bit doubles). However you should not use this program to find primes that large!

(I didn't recalculate the offsets, just took them from code of Robert William Hanks. Thanks Robert!)

import numpy as np
def prime_30_gen(n):
    """ Input n, int-like.  Yields all primes < n as Python ints"""
    wheel = np.array([2,3,5])
    yield from wheel[wheel < n].tolist()
    powers = 1 << np.arange(8, dtype='u1')
    odds = np.array([1, 7, 11, 13, 17, 19, 23, 29], dtype='i8')
    offsets=np.array([[0,6,10,12,16,18,22,28],[6,24,16,12,4,0,22,10],
                      [0,6,20,12,26,18,2,8],  [24,6,4,18,16,0,28,10],
                      [6,24,26,12,14,0,2,20], [0,24,10,18,4,12,28,22],
                      [24,6,14,18,26,0,8,20], [0,24,20,18,14,12,8,2]],
                     dtype='i8')
    offsets = {d:f for d,f in zip(odds, offsets)}
    sieve = 255 * np.ones((n + 28) // 30, dtype='u1')
    if n % 30 > 1: sieve[-1] >>= 8 - odds.searchsorted(n % 30)
    sieve[0] &= ~1 
    for i in range((int((n - 1)**0.5) + 29) // 30):
        for odd in odds[(sieve[i] & powers).nonzero()[0]]:
            k = i * 30 + odd
            yield int(k)
            for clear_bit, off in zip(~powers, offsets[odd]): 
                sieve[(k * (k + off)) // 30 :: k] &= clear_bit
    chunk = min(1 + (n >> 13), 1<<15)
    for i in range(i + 1, len(sieve), chunk):
        a = (sieve[i : i + chunk, np.newaxis] & powers)
        a = np.flatnonzero(a.astype('bool')) + i * 8
        a = (a // 8 * 30 + odds[a & 7]).tolist()
        yield from a
    return sieve

Side note: If you want real speed and have the 2GB of RAM required for the list of primes to 10**9, then you should use pyprimesieve (on https://pypi.python.org/, using primesieve http://primesieve.org/).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3
Solution 4
Solution 5
Solution 6
Solution 7 Wayne Werner
Solution 8 kriss
Solution 9 user448810
Solution 10 Will Ness