'Modifying low byte of 64-bit variable like rax/al registers without compiler overhead

I have a performance critical piece of code that contains in reality uint8_t variables (guaranteed not to overflow) that are incremented from other uint8_t values and also used as parts of array indexing and other 64-bit address calculations.

I have looked at the disassembly of MSVC compiler with full optimizations and annoyingly there's plenty of excessive movzx and other unnecessary additional instructions for converting between 8-bit and 64-bit operations, no matter how I try to do it. If I use a 8-bit variable, address calculations perform additional zero extensions etc. through temporary registers and if I use a 64-bit variable there are similar additional operations for the other 8-bit values that are added to it.

If that was written with assembly, there would be no problem, as the value could be accessed simultaneously with e.g. rax and al registers as needed. Is there some way to access the low byte (especially for performing additions) of a uint64_t variable with C++ so that MSVC would be smart enough to compile it with e.g. simple direct al register access (a.g. add al, other_uint8_var) when the full variable is in the rax register?

I tried several alternatives like bit masking high/low parts for emulating the low byte change, aliasing with a union of 8-bit and 64-bit values, aliasing the 64-bit value with temporary 8-bit reference variable etc. All of them just led to worse results, often so that the variable was moved from the register to temporary memory location for performing the change.


Simplest example:

#include <stdint.h>
unsigned idx(unsigned *ptr, uint8_t *v)
{
    uint8_t tmp = v[1] + v[2];       // truncate the sum to 8-bit
    return ptr[tmp];                 // before using with a 64-bit pointer
}

All compilers (Godbolt: GCC11/clang14/MSVC19.31/ICC19.01) do a bad job, wasting a movzx eax,al that can't even benefit from mov-elimination for zero latency because they widen within the same register. MSVC 19.31 -O2 compiles to:

unsigned int idx(unsigned int *,unsigned char *) PROC                                ; idx, COMDAT
        movzx   eax, BYTE PTR [rdx+2]
        add     al, BYTE PTR [rdx+1]
        movzx   eax, al                       ;; fully redundant, value already truncated to 8-bit and zero-extended to 64
        mov     eax, DWORD PTR [rcx+rax*4]
        ret     0

Clang/LLVM actually does even worse, starting with a mov al, [mem] load with a false dependency on the old value of RAX (on CPUs other than P6-family and first-generation Sandybridge). But is saves one byte of machine-code size.


Further examples how MSVC compiles:

The following runnable program generates 3 random integers which are summed as array index. Interesting part is placed in the test() function to force the compiler to use the desired argument types and keep the part separated for easy viewing of the assembly from the otherwise inlined code.

#include <iostream>
#include <cstdlib>

static constexpr uint64_t array[30]{ 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 };

struct Test {
    __declspec(noinline) static uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
        uint64_t index = base;
        index += added1;
        index += added2;
        return array[index];
    }
};

int main()
{
    uint64_t base = rand() % 10;
    uint8_t added1 = rand() % 10;
    uint8_t added2 = rand() % 10;

    uint64_t result = Test::test(base, added1, added2);

    std::cout << "array[" << base << "+" << (uint64_t)added1 << "+" << (uint64_t)added2 << "]=" << result << std::endl;
    return 0;
}

The above test function with uint64 base index compiles to:

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    movzx       edx,dl  
    add         rcx,rdx  
    movzx       eax,r8b  
    add         rax,rcx  
    lea         rcx,[array (07FF74FD63340h)]  
    mov         rax,qword ptr [rcx+rax*8]  
    ret  
}

Compiler has assigned rcx=base, dl=added1, r8b=added2. uint8_t values are separately zero extended before summing.

Changing the base index to uint8_t compiles:

uint64_t test(uint8_t base, uint8_t added1, uint8_t added2) {
    uint8_t index = base;
    index += added1;
    index += added2;
    return array[index];
}

uint64_t test(uint8_t base, uint8_t added1, uint8_t added2) {
    add         cl,dl  
    add         cl,r8b  
    movzx       eax,cl  
    lea         rcx,[array (07FF6287C3340h)]  
    mov         rax,qword ptr [rcx+rax*8]  
    ret  
}

So now the compiler is happy to do the math with 8-bit registers but needs to separately zero extend the result for addressing.

What I would like to get is basically the above without the movzx, so that rcx would be used directly as the array offset, since I know all except the lowest byte are already zero. Obviously the compiler cannot do that automatically, since unlike me, it doesn't know the additions can't cause overflow. So what I'm missing is a way to tell it just that.

If I try to cast the target register as 8-bit (which tends to work for read operations) or use a union for modeling the variable like rcx/cl registers, it does it through a stack variable:

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    uint64_t index = base;
    reinterpret_cast<uint8_t&>(index) += added1;
    reinterpret_cast<uint8_t&>(index) += added2;
    return array[index];
}

OR:

union Register {
    uint64_t u64;
    uint8_t u8;
};

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    Register index;
    index.u64 = base;
    index.u8 += added1;
    index.u8 += added2;
    return array[index.u64];
}

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    mov         qword ptr [rsp+8],rcx  
    add         cl,dl  
    add         cl,r8b  
    mov         byte ptr [index],cl  
    lea         rcx,[array (07FF798B53340h)]  
    mov         rax,qword ptr [index]  
    mov         rax,qword ptr [rcx+rax*8]  
    ret  
}

Attempting to indicate my intentions through some bit masking compiles to:

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    uint64_t index = base;
    index = (index & 0xffffffffffffff00) | (uint8_t)(index + added1);
    index = (index & 0xffffffffffffff00) | (uint8_t)(index + added2);
    return array[index];
}

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    lea         eax,[rdx+rcx]  
    and         rcx,0FFFFFFFFFFFFFF00h  
    movzx       edx,al  
    or          rdx,rcx  
    lea         rcx,[array (07FF70F4B3340h)]  
    lea         eax,[r8+rdx]  
    and         rdx,0FFFFFFFFFFFFFF00h  
    movzx       eax,al  
    or          rax,rdx  
    mov         rax,qword ptr [rcx+rax*8]  
    ret  
}

I believe some similar pattern works on some compilers for overwriting (instead of adding to) the low byte, but here the compiler clearly failed to recognize that pattern. Another similar pattern generates this:

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    uint64_t index = base;
    index = (index & 0xffffffffffffff00) | (((index & 0xff) + added1) & 0xff);
    index = (index & 0xffffffffffffff00) | (((index & 0xff) + added2) & 0xff);
    return array[index];
}

uint64_t test(uint64_t base, uint8_t added1, uint8_t added2) {
    movzx       eax,dl  
    add         rax,rcx  
    xor         rax,rcx  
    movzx       edx,al  
    xor         rdx,rcx  
    movzx       eax,r8b  
    add         rax,rdx  
    lea         rcx,[array (07FF6859F3340h)]  
    xor         rax,rdx  
    movzx       eax,al  
    xor         rax,rdx  
    mov         rax,qword ptr [rcx+rax*8]  
    ret
}


Solution 1:[1]

Why not simply write inline assembler code? For really highly critical code, it may be the only solution... I had to do that for reading from a CompactFlash on a bare-metal platform, for example: C or C++ code wasn't even able to match required timings...

Also, it may help you, by benchmarking this assembler code, to see if you really get more performances - or not! - and maybe give you a goal and/or an indication of what you can expect to reach.

If you need portability, you can still have conditional compilation, with all optimized inline assembler for all known targets, and leave the best C/C++ code you have for every other cases - won't be perfect, but not all CPU has partial registers anyway so a generic C code could do the trick for these platforms.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1