'Implementing SHLD/SHRD instructions in C

I'm trying to efficiently implement SHLD and SHRD instructions of x86 without using inline assembly.

uint32_t shld_UB_on_0(uint32_t a, uint32_t b, uint32_t c) {
    return a << c | b >> 32 - c;
}

seems to work, but invokes undefined behaviour when c == 0 because the second shift's operand becomes 32. The actual SHLD instruction with third operand being 0 is well defined to do nothing. (https://www.felixcloutier.com/x86/shld)

uint32_t shld_broken_on_0(uint32_t a, uint32_t b, uint32_t c) {
    return a << c | b >> (-c & 31);
}

doesn't invoke undefined behaviour, but when c == 0 the result is a | b instead of a.

uint32_t shld_safe(uint32_t a, uint32_t b, uint32_t c) {
    if (c == 0) return a;
    return a << c | b >> 32 - c;
}

does what's intended, but gcc now puts a je. clang on the other hand is smart enough to translate it to a single shld instruction.

Is there any way to implement it correctly and efficiently without inline assembly?

And why is gcc trying so much not to put shld? The shld_safe attempt is translated by gcc 11.2 -O3 as (Godbolt):

shld_safe:
        mov     eax, edi
        test    edx, edx
        je      .L1
        mov     ecx, 32
        sub     ecx, edx
        shr     esi, cl
        mov     ecx, edx
        sal     eax, cl
        or      eax, esi
.L1:
        ret

while clang does,

shld_safe:
        mov     ecx, edx
        mov     eax, edi
        shld    eax, esi, cl
        ret


Solution 1:[1]

As far as I have tested with gcc 9.3 (x86-64), it translates the following code to shldq and shrdq.

uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) {
  return (uint64_t)(((((unsigned __int128)high << 64) | (unsigned __int128)low) << (count & 63)) >> 64);
}

uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) {
  return (uint64_t)((((unsigned __int128)high << 64) | (unsigned __int128)low) >> (count & 63));
}

Also, gcc -m32 -O3 translates the following code to shld and shrd. (I have not tested with gcc (i386), though.)

uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) {
  return (uint32_t)(((((uint64_t)high << 32) | (uint64_t)low) << (count & 31)) >> 32);
}

uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) {
  return (uint32_t)((((uint64_t)high << 32) | (uint64_t)low) >> (count & 31));
}

(I have just read the gcc code and written the above functions, i.e. I'm not sure they are your expected ones.)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Hironori Bono