'Why does division by 3 require a rightshift (and other oddities) on x86?

I have the following C/C++ function:

unsigned div3(unsigned x) {
    return x / 3;
}

When compiled using clang 10 at -O3, this results in:

div3(unsigned int):
        mov     ecx, edi         # tmp = x
        mov     eax, 2863311531  # result = 3^-1
        imul    rax, rcx         # result *= tmp
        shr     rax, 33          # result >>= 33
        ret

What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.

There are some things that I don't understand though:

  1. Why do we need to use ecx/rcx at all? Can't we multiply rax with edi directly?
  2. Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply eax and ecx?
  3. Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
  4. What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.

Edit 1

For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here. For example:

// multiplying with inverse of 3:
15 * 2863311531      = 42949672965
42949672965 mod 2^32 = 5

// using fixed-point multiplication
15 * 2863311531      = 42949672965
42949672965 >> 33    = 5

// simply dividing by 3
15 / 3               = 5

So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.

Edit 2

I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.

Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul instruction in this case, even when it could. The following function has the same compile output as above.

unsigned div3(unsigned x) {
    __builtin_assume(x % 3 == 0);
    return x / 3;
}

(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)



Solution 1:[1]

What's up with the 33-bit right shift at the end? I thought we can just drop the highest 32-bits.

Instead of 3^(-1) mod 3 you have to think more about 0.3333333 where the 0 before the . is located in the upper 32 bit and the the 3333 is located in the lower 32 bit. This fixed point operation works fine, but the result is obviously shifted to the upper part of rax, therefor the CPU must shift the result down again after the operation.

Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.

There is no MUL instruction equivalent to the IMUL instruction. The IMUL variant that is used takes two registers:

a <= a * b

There is no MUL instruction that does that. MUL instructions are more expensive because they store the result as 128 Bit in two registers. Of course you could use the legacy instructions, but this does not change the fact that the result is stored in two registers.

Solution 2:[2]

If you look at my answer to the prior question:

Why does GCC use multiplication by a strange number in implementing integer division?

It contains a link to a pdf article that explains this (my answer clarifies the stuff that isn't explained well in this pdf article):

https://gmplib.org/~tege/divcnst-pldi94.pdf

Note that one extra bit of precision is needed for some divisors, such as 7, the multiplier would normally require 33 bits, and the product would normally require 65 bits, but this can be avoided by handling the 2^32 bit separately with 3 additional instructions as shown in my prior answer and below.

Take a look at the generated code if you change to

unsigned div7(unsigned x) {
    return x / 7;
}

So to explain the process, let L = ceil(log2(divisor)). For the question above, L = ceil(log2(3)) == 2. The right shift count would initially be 32+L = 34.

To generate a multiplier with a sufficient number of bits, two potential multipliers are generated: mhi will be the multiplier to be used, and the shift count will be 32+L.

mhi = (2^(32+L) + 2^(L))/3 = 5726623062
mlo = (2^(32+L)        )/3 = 5726623061

Then a check is made to see if the number of required bits can be reduced:

while((L > 0) && ((mhi>>1) > (mlo>>1))){
    mhi = mhi>>1;
    mlo = mlo>>1;
    L   = L-1;
}
if(mhi >= 2^32){
    mhi = mhi-2^32
    L   = L-1;
    ; use 3 additional instructions for missing 2^32 bit
}
... mhi>>1 = 5726623062>>1 = 2863311531
... mlo>>1 = 5726623061>>1 = 2863311530  (mhi>>1) > (mlo>>1)
... mhi    = mhi>>1 = 2863311531
... mlo    = mhi>>1 = 2863311530
... L = L-1 = 1
... the next loop exits since now (mhi>>1) == (mlo>>1)

So the multiplier is mhi = 2863311531 and the shift count = 32+L = 33.

On an modern X86, multiply and shift instructions are constant time, so there's no point in reducing the multiplier (mhi) to less than 32 bits, so that while(...) above is changed to an if(...).

In the case of 7, the loop exits on the first iteration, and requires 3 extra instructions to handle the 2^32 bit, so that mhi is <= 32 bits:

L = ceil(log2(7)) = 3
mhi = (2^(32+L) + 2^(L))/7 = 4908534053
mhi = mhi-2^32 = 613566757

Let ecx = dividend, the simple approach could overflow on the add:

mov eax, 613566757             ; eax = mhi
mul ecx                        ; edx:eax = ecx*mhi
add edx, ecx                   ; edx:eax = ecx*(mhi + 2^32), potential overflow
shr edx, 3

To avoid the potential overflow, note that eax = eax*2 - eax:

(ecx*eax)           =   (ecx*eax)<<1)             -(ecx*eax)
(ecx*(eax+2^32))    =   (ecx*eax)<<1)+  (ecx*2^32)-(ecx*eax)
(ecx*(eax+2^32))>>3 =  ((ecx*eax)<<1)+  (ecx*2^32)-(ecx*eax)     )>>3
                    = (((ecx*eax)   )+(((ecx*2^32)-(ecx*eax))>>1))>>2

so the actual code, using u32() to mean upper 32 bits:

...                 visual studio generated code for div7, dividend is ecx
mov eax, 613566757
mul ecx                        ; edx = u32( (ecx*eax) )
sub ecx, edx                   ; ecx = u32(            ((ecx*2^32)-(ecx*eax))        )
shr ecx, 1                     ; ecx = u32(           (((ecx*2^32)-(ecx*eax))>>1)    )
lea eax, DWORD PTR [edx+ecx]   ; eax = u32( (ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1)    )
shr eax, 2                     ; eax = u32(((ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1))>>2)

If a remainder is wanted, then the following steps can be used:

mhi and L are generated based on divisor during compile time
...
quotient  = (x*mhi)>>(32+L)
product   = quotient*divisor
remainder = x - product

Solution 3:[3]

x/3 is approximately (x * (2^32/3)) / 2^32. So we can perform a single 32x32->64 bit multiplication, take the higher 32 bits, and get approximately x/3.

There is some error because we cannot multiply exactly by 2^32/3, only by this number rounded to an integer. We get more precision using x/3 ? (x * (2^33/3)) / 2^33. (We can't use 2^34/3 because that is > 2^32). And that turns out to be good enough to get x/3 in all cases exactly. You would prove this by checking that the formula gives a result of k if the input is 3k or 3k+2.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 gnasher729