'Fast divisibility tests (by 2,3,4,5,.., 16)?
What are the fastest divisibility tests? Say, given a little-endian architecture and a 32-bit signed integer: how to calculate very fast that a number is divisible by 2,3,4,5,... up to 16?
WARNING: given code is EXAMPLE only. Every line is independent! Just obvious solution using modulo operation is slow on many processors, which don't have DIV hardware (like many ARMs). Some compilers are also cannot make such optimizations (say, if divisor is a function's argument or is dependent on something).
Divisible_by_1 = do();
Divisible_by_2 = if (!(number & 1)) do();
Divisible_by_3 = ?
Divisible_by_4 = ?
Divisible_by_5 = ?
Divisible_by_6 = ?
Divisible_by_7 = ?
Divisible_by_8 = ?
Divisible_by_9 = ?
Divisible_by_10 = ?
Divisible_by_11 = ?
Divisible_by_12 = ?
Divisible_by_13 = ?
Divisible_by_14 = ?
Divisible_by_15 = ?
Divisible_by_16 = if(!number & 0x0000000F) do();
and special cases:
Divisible_by_2k = if(number & (tk-1)) do(); //tk=2**k=(2*2*2*...) k times
Solution 1:[1]
In every case (including divisible by 2):
if (number % n == 0) do();
Anding with a mask of low order bits is just obfuscation, and with a modern compiler will not be any faster than writing the code in a readable fashion.
If you have to test all of the cases, you might improve performance by putting some of the cases in the if for another: there's no point it testing for divisibility by 4 if divisibility by 2 has already failed, for example.
Solution 2:[2]
As @James mentioned, let the compiler simplify it for you. If n is a constant, any decent compiler is able to recognize the pattern and change it to a more efficient equivalent.
For example, the code
#include <stdio.h>
int main() {
size_t x;
scanf("%u\n", &x);
__asm__ volatile ("nop;nop;nop;nop;nop;");
const char* volatile foo = (x%3 == 0) ? "yes" : "no";
__asm__ volatile ("nop;nop;nop;nop;nop;");
printf("%s\n", foo);
return 0;
}
compiled with g++-4.5 -O3, the relevant part of x%3 == 0 will become
mov rcx,QWORD PTR [rbp-0x8] # rbp-0x8 = &x
mov rdx,0xaaaaaaaaaaaaaaab
mov rax,rcx
mul rdx
lea rax,"yes"
shr rdx,1
lea rdx,[rdx+rdx*2]
cmp rcx,rdx
lea rdx,"no"
cmovne rax,rdx
mov QWORD PTR [rbp-0x10],rax
which, translated back to C code, means
(hi64bit(x * 0xaaaaaaaaaaaaaaab) / 2) * 3 == x ? "yes" : "no"
// equivalatent to: x % 3 == 0 ? "yes" : "no"
no division involved here. (Note that 0xaaaaaaaaaaaaaaab == 0x20000000000000001L/3)
Edit:
- The magic constant 0xaaaaaaaaaaaaaaab can be computed in http://www.hackersdelight.org/magic.htm
- For divisors of the form 2n - 1, check http://graphics.stanford.edu/~seander/bithacks.html#ModulusDivision
Solution 3:[3]
A bit tongue in cheek, but assuming you get the rest of the answers:
Divisible_by_6 = Divisible_by_3 && Divisible_by_2;
Divisible_by_10 = Divisible_by_5 && Divisible_by_2;
Divisible_by_12 = Divisible_by_4 && Divisible_by_3;
Divisible_by_14 = Divisible_by_7 && Divisible_by_2;
Divisible_by_15 = Divisible_by_5 && Divisible_by_3;
Solution 4:[4]
Assume number is unsigned (32-bits). Then the following are very fast ways to compute divisibility up to 16. (I haven't measured but the assembly code indicates so.)
bool divisible_by_2 = number % 2 == 0;
bool divisible_by_3 = number * 2863311531u <= 1431655765u;
bool divisible_by_4 = number % 4 == 0;
bool divisible_by_5 = number * 3435973837u <= 858993459u;
bool divisible_by_6 = divisible_by_2 && divisible_by_3;
bool divisible_by_7 = number * 3067833783u <= 613566756u;
bool divisible_by_8 = number % 8 == 0;
bool divisible_by_9 = number * 954437177u <= 477218588u;
bool divisible_by_10 = divisible_by_2 && divisible_by_5;
bool divisible_by_11 = number * 3123612579u <= 390451572u;
bool divisible_by_12 = divisible_by_3 && divisible_by_4;
bool divisible_by_13 = number * 3303820997u <= 330382099u;
bool divisible_by_14 = divisible_by_2 && divisible_by_7;
bool divisible_by_15 = number * 4008636143u <= 286331153u;
bool divisible_by_16 = number % 16 == 0;
Regarding divisibility by d the following rules hold:
- When
dis a power of 2:
As pointed out by James Kanze, you can use is_divisible_by_d = (number % d == 0). Compilers are clever enough to implement this as (number & (d - 1)) == 0 which is very efficient but obfuscated.
However, when d is not a power of 2 it looks like the obfuscations shown above are more efficient than what current compilers do. (More on that later).
- When
dis odd:
The technique takes the form is_divisible_by_d = number * a <= b where a and b are cleverly obtained constants. Notice that all we need is 1 multiplication and 1 comparison:
- When
dis even but not a power of 2:
Then, write d = p * q where p is a power of 2 and q is odd and use the "tongue in cheek" suggested by unpythonic, that is, is_divisible_by_d = is_divisible_by_p && is_divisible_by_q. Again, only 1 multiplication (in the calculation of is_divisible_by_q) is performed.
Many compilers (I've tested clang 5.0.0, gcc 7.3, icc 18 and msvc 19 using godbolt) replace number % d == 0 by (number / d) * d == number. They use a clever technique (see references in Olof Forshell's answer) to replace the division by a multiplication and a bit shift. They end up doing 2 multiplications. In contrast the techniques above perform only 1 multiplication.
Update 01-Oct-2018
Looks like the algorithm above is coming to GCC soon (already in trunk):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853
The GCC's implementation seems even more efficient. Indeed, the implementation above has three parts: 1) divisibility by the divisor's even part; 2) divisibility by the divisor's odd part; 3) && to connect the results of the two previous steps. By using an assembler instruction which is not efficiently available in standard C++ (ror), GCC wraps up the three parts into a single one which is very similar to that of divisibility by the odd part. Great stuff! Having this implementation available, it's better (for both clarity and performance) to fall back to % all times.
Update 05-May-2020
My articles on the subject have been published:
Quick Modular Calculations (Part 1), Overload Journal 154, December 2019, pages 11-15.
Quick Modular Calculations (Part 2), Overload Journal 155, February 2020, pages 14-17.
Quick Modular Calculations (Part 3), Overload Journal 156, April 2020, pages 10-13.
Solution 5:[5]
First of all, I remind you that a number in the form bn...b2b1b0 in binary has value:
number = bn*2^n+...+b2*4+b1*2+b0
Now, when you say number%3, you have:
number%3 =3= bn*(2^n % 3)+...+b2*1+b1*2+b0
(I used =3= to indicate congruence modulo 3). Note also that b1*2 =3= -b1*1
Now I will write all the 16 divisions using + and - and possibly multiplication (note that multiplication could be written as shift or sum of same value shifted to different locations. For example 5*x means x+(x<<2) in which you compute x once only)
Let's call the number n and let's say Divisible_by_i is a boolean value. As an intermediate value, imagine Congruence_by_i is a value congruent to n modulo i.
Also, lets say n0 means bit zero of n, n1 means bit 1 etc, that is
ni = (n >> i) & 1;
Congruence_by_1 = 0
Congruence_by_2 = n&0x1
Congruence_by_3 = n0-n1+n2-n3+n4-n5+n6-n7+n8-n9+n10-n11+n12-n13+n14-n15+n16-n17+n18-n19+n20-n21+n22-n23+n24-n25+n26-n27+n28-n29+n30-n31
Congruence_by_4 = n&0x3
Congruence_by_5 = n0+2*n1-n2-2*n3+n4+2*n5-n6-2*n7+n8+2*n9-n10-2*n11+n12+2*n13-n14-2*n15+n16+2*n17-n18-2*n19+n20+2*n21-n22-2*n23+n24+2*n25-n26-2*n27+n28+2*n29-n30-2*n31
Congruence_by_7 = n0+2*n1+4*n2+n3+2*n4+4*n5+n6+2*n7+4*n8+n9+2*n10+4*n11+n12+2*n13+4*n14+n15+2*n16+4*n17+n18+2*n19+4*n20+n21+2*n22+4*n23+n24+2*n25+4*n26+n27+2*n28+4*n29+n30+2*n31
Congruence_by_8 = n&0x7
Congruence_by_9 = n0+2*n1+4*n2-n3-2*n4-4*n5+n6+2*n7+4*n8-n9-2*n10-4*n11+n12+2*n13+4*n14-n15-2*n16-4*n17+n18+2*n19+4*n20-n21-2*n22-4*n23+n24+2*n25+4*n26-n27-2*n28-4*n29+n30+2*n31
Congruence_by_11 = n0+2*n1+4*n2+8*n3+5*n4-n5-2*n6-4*n7-8*n8-5*n9+n10+2*n11+4*n12+8*n13+5*n14-n15-2*n16-4*n17-8*n18-5*n19+n20+2*n21+4*n22+8*n23+5*n24-n25-2*n26-4*n27-8*n28-5*n29+n30+2*n31
Congruence_by_13 = n0+2*n1+4*n2+8*n3+3*n4+6*n5-n6-2*n7-4*n8-8*n9-3*n10-6*n11+n12+2*n13+4*n14+8*n15+3*n16+6*n17-n18-2*n19-4*n20-8*n21-3*n22-6*n3+n24+2*n25+4*n26+8*n27+3*n28+6*n29-n30-2*n31
Congruence_by_16 = n&0xF
Or when factorized:
Congruence_by_1 = 0
Congruence_by_2 = n&0x1
Congruence_by_3 = (n0+n2+n4+n6+n8+n10+n12+n14+n16+n18+n20+n22+n24+n26+n28+n30)-(n1+n3+n5+n7+n9+n11+n13+n15+n17+n19+n21+n23+n25+n27+n29+n31)
Congruence_by_4 = n&0x3
Congruence_by_5 = n0+n4+n8+n12+n16+n20+n24+n28-(n2+n6+n10+n14+n18+n22+n26+n30)+2*(n1+n5+n9+n13+n17+n21+n25+n29-(n3+n7+n11+n15+n19+n23+n27+n31))
Congruence_by_7 = n0+n3+n6+n9+n12+n15+n18+n21+n24+n27+n30+2*(n1+n4+n7+n10+n13+n16+n19+n22+n25+n28+n31)+4*(n2+n5+n8+n11+n14+n17+n20+n23+n26+n29)
Congruence_by_8 = n&0x7
Congruence_by_9 = n0+n6+n12+n18+n24+n30-(n3+n9+n15+n21+n27)+2*(n1+n7+n13+n19+n25+n31-(n4+n10+n16+n22+n28))+4*(n2+n8+n14+n20+n26-(n5+n11+n17+n23+n29))
// and so on
If these values end up being negative, add it with i until they become positive.
Now what you should do is recursively feed these values through the same process we just did until Congruence_by_i becomes less than i (and obviously >= 0). This is similar to what we do when we want to find remainder of a number by 3 or 9, remember? Sum up the digits, if it had more than one digit, some up the digits of the result again until you get only one digit.
Now for i = 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 16:
Divisible_by_i = (Congruence_by_i == 0);
And for the rest:
Divisible_by_6 = Divisible_by_3 && Divisible_by_2;
Divisible_by_10 = Divisible_by_5 && Divisible_by_2;
Divisible_by_12 = Divisible_by_4 && Divisible_by_3;
Divisible_by_14 = Divisible_by_7 && Divisible_by_2;
Divisible_by_15 = Divisible_by_5 && Divisible_by_3;
Edit: Note that some of the additions could be avoided from the very beginning. For example n0+2*n1+4*n2 is the same as n&0x7, similarly n3+2*n4+4*n5 is (n>>3)&0x7 and thus with each formula, you don't have to get each bit individually, I wrote it like that for the sake of clarity and similarity in operation. To optimize each of the formulas, you should work on it yourself; group operands and factorize operation.
Solution 6:[6]
The LCM of these numbers seems to be 720720. Its quite small, so that you can perform a single modulus operation and use the remainder as the index in the precomputed LUT.
Solution 7:[7]
You should just use (i % N) == 0 as your test.
My compiler (a fairly old version of gcc) generated good code for all the cases I tried. Where bit tests were appropriate it did that. Where N was a constant it didn't generate the obvious "divide" for any case, it always used some "trick".
Just let the compiler generate the code for you, it will almost certainly know more about the architecture of the machine than you do :) And these are easy optimisations where you are unlikely to think up something better than the compiler does.
It's an interesting question though. I can't list the tricks used by the compiler for each constant as I have to compile on a different computer.. But I'll update this reply later on if nobody beats me to it :)
Solution 8:[8]
This probably won't help you in code, but there's a neat trick which can help do this in your head in some cases:
For divide by 3: For a number represented in decimal, you can sum all the digits, and check if the sum is divisible by 3.
Example: 12345 => 1+2+3+4+5 = 15 => 1+5 = 6, which is divisible by 3 (3 x 4115 = 12345).
More interestingly the same technique works for all factors of X-1, where X is the base in which the number is represented. So for decimal number, you can check divide by 3 or 9. For hex, you can check divide by 3,5 or 15. And for octal numbers, you can check divide by 7.
Solution 9:[9]
In a previous question, I showed a fast algorithm to check in base N for divisors that are factors of N-1. Base transformations between different powers of 2 are trivial; that's just bit grouping.
Therefore, checking for 3 is easy in base 4; checking for 5 is easy in base 16, and checking for 7 (and 9) is easy in base 64.
Non-prime divisors are trivial, so only 11 and 13 are hard cases. For 11, you could use base 1024, but at that point it's not really efficient for small integers.
Solution 10:[10]
A method that can help modulo reduction of all integer values uses bit-slicing and popcount.
mod3 = pop(x & 0x55555555) + pop(x & 0xaaaaaaaa) << 1; // <- one term is shared!
mod5 = pop(x & 0x99999999) + pop(x & 0xaaaaaaaa) << 1 + pop(x & 0x44444444) << 2;
mod7 = pop(x & 0x49249249) + pop(x & 0x92492492) << 1 + pop(x & 0x24924924) << 2;
modB = pop(x & 0x5d1745d1) + pop(x & 0xba2e8ba2) << 1 +
pop(x & 0x294a5294) << 2 + pop(x & 0x0681a068) << 3;
modD = pop(x & 0x91b91b91) + pop(x & 0xb2cb2cb2) << 1 +
pop(x & 0x64a64a64) << 2 + pop(x & 0xc85c85c8) << 3;
The maximum values for these variables are 48, 80, 73, 168 and 203, which all fit into 8-bit variables. The second round can be carried in parallel (or some LUT method can be applied)
mod3 mod3 mod5 mod5 mod5 mod7 mod7 mod7 modB modB modB modB modD modD modD modD
mask 0x55 0xaa 0x99 0xaa 0x44 0x49 0x92 0x24 0xd1 0xa2 0x94 0x68 0x91 0xb2 0x64 0xc8
shift *1 *2 *1 *2 *4 *1 *2 *4 *1 *2 *4 *8 *1 *2 *4 *8
sum <-------> <------------> <-----------> <-----------------> <----------------->
Solution 11:[11]
You can replace division by a non-power-of-two constant by a multiplication, essentially multiplying by the reciprocal of your divisor. The details to get the exact result by this method are complicated.
Hacker's Delight discusses this at length in chapter 10 (unfortunately not available online).
From the quotient you can get the modulus by another multiplication and a subtraction.
Solution 12:[12]
One thing to consider: since you only care about divisibility up to 16, you really only need to check divisibility by the primes up to 16. These are 2, 3, 5, 7, 11, and 13.
Divide your number by each of the primes, keeping track with a boolean (such as div2 = true). The numbers two and three are special cases. If div3 is true, try dividing by 3 again, setting div9. Two and its powers are very simple (note: '&' is one of the fastest things a processor can do):
if n & 1 == 0:
div2 = true
if n & 3 == 0:
div4 = true
if n & 7 == 0:
div8 = true
if n & 15 == 0:
div16 = true
You now have the booleans div2, div3, div4, div5, div7, div8, div9, div11, div13, and div16. All other numbers are combinations; for instance div6 is the same as (div2 && div3)
So, you only need to do either 5 or 6 actual divisions (6 only if your number is divisible by 3).
For myself, i would probably use bits in a single register for my booleans; for instance bit_0 means div2. I can then use masks:
if (flags & (div2+div3)) == (div2 + div3): do_6()
note that div2+div3 can be a precomputed constant. If div2 is bit0, and div3 is bit1, then div2+div3 == 3. This makes the above 'if' optimize to:
if (flags & 3) == 3: do_6()
So now... mod without a divide:
def mod(n,m):
i = 0
while m < n:
m <<= 1
i += 1
while i > 0:
m >>= 1
if m <= n: n -= m
i -= 1
return n
div3 = mod(n,3) == 0
...
btw: the worst case for the above code is 31 times through either loop for a 32-bit number
FYI: Just looked at Msalter's post, above. His technique can be used instead of mod(...) for some of the primes.
Solution 13:[13]
Fast tests for divisibility depend heavily on the base in which the number is represented. In case when base is 2, I think you can only do "fast tests" for divisibility by powers of 2. A binary number is divisible by 2n iff the last n binary digits of that number are 0. For other tests I don't think you can generally find anything faster than %.
Solution 14:[14]
A bit of evil, obfuscated bit-twiddling can get you divisbility by 15.
For a 32-bit unsigned number:
def mod_15ish(unsigned int x) {
// returns a number between 0 and 21 that is either x % 15
// or 15 + (x % 15), and returns 0 only for x == 0
x = (x & 0xF0F0F0F) + ((x >> 4) & 0xF0F0F0F);
x = (x & 0xFF00FF) + ((x >> 8) & 0xFF00FF);
x = (x & 0xFFFF) + ((x >> 16) & 0xFFFF);
// *1
x = (x & 0xF) + ((x >> 4) & 0xF);
return x;
}
def Divisible_by_15(unsigned int x) {
return ((x == 0) || (mod_15ish(x) == 15));
}
You can build similar divisibility routines for 3 and 5 based on mod_15ish.
If you have 64-bit unsigned ints to deal with, extend each constant above the *1 line in the obvious way, and add a line above the *1 line to do a right shift by 32 bits with a mask of 0xFFFFFFFF. (The last two lines can stay the same) mod_15ish then obeys the same basic contract, but the return value is now between 0 and 31. (so what's maintained is that x % 15 == mod_15ish(x) % 15)
Solution 15:[15]
Here are some tips I haven't see anyone else suggest yet:
One idea is to use a switch statement, or precompute some array. Then, any decent optimizer can simply index each case directly. For example:
// tests for (2,3,4,5,6,7)
switch (n % 8)
{
case 0: break;
case 1: break;
case 2: do(2); break;
case 3: do(3); break;
case 4: do(2); do(4) break;
case 5: do(5); break;
case 6: do(2); do(3); do(4); break;
case 7: do(7); break;
}
Your application is a bit ambiguous, but you may only need to check prime numbers less than n=16. This is because all numbers are factors of the current or previous prime numbers. So for n=16, you might be able to get away with only checking 2, 3, 5, 7, 11, 13 somehow. Just a thought.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
