'A64 Neon SIMD - 256-bit comparison

I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently.

Equality (=)

For equality, I already got a solution:

bool eq256(const UInt256 *lhs, const UInt256 *rhs) {
    bool result;

First, load the two values into SIMD registers.

    __asm__("ld1.2d    { v0, v1 }, %1    \n\t"
            "ld1.2d    { v2, v3 }, %2    \n\t"

Compare each 64-bit limb of the values with each other. This results in -1 (all bits set) for those limbs that are equal, and 0 (all bits clear) if a bit differs.

            "cmeq.2d   v0, v0, v2        \n\t"
            "cmeq.2d   v1, v1, v3        \n\t"

Reduce the result from 2 vectors to 1 vector, keeping only the one that contains "0 (all bits clear)" if there is any.

            "uminp.16b v0, v0, v1        \n\t"

Reduce the result from 1 vector to 1 byte, keeping only a byte with zeroes if there is any.

            "uminv.16b b0, v0            \n\t"

Move to ARM register, then compare with 0xFF. This is the result.

            "umov      %w0, v0.b[0]      \n\t"
            "cmp       %w0, 0xFF         \n\t"
            "cset      %w0, eq               "
            : "=r" (result)
            : "m" (*lhs->value), "m" (*rhs->value)
            : "v0", "v1", "v2", "v3", "cc");
    return result;
}

Questions

Is this more efficient than doing the 4 comparisons with plain old ARM registers?
- e.g. is there a source that quotes timings for the different operations? I'm doing this on iPhone 5s.
Is there a way to optimize this even further? I think that I waste many cycles just to reduce the whole vector to a single scalar boolean.

Less Than comparison (<)

Let's represent the two ints as tuples of 64-bit limbs (little-endian):

lhs = (l0, l1, l2, l3)
rhs = (r0, r1, r2, r3)

Then, lhs < rhs, if this evaluates to true:

(l3 < r3) &     1     &     1     &     1     |
(l3 = r3) & (l2 < r2) &     1     &     1     |
(l3 = r3) & (l2 = r2) & (l1 < r1) &     1     |
(l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)

SIMD instructions can now be used to evaluate multiple operands at a time. Assuming (l1, l2), (l3, l4), (r1, r2), (r3, r4) is the way that the two 256-bit numbers are stored, we can easily get all of the required values (useful values in bold):

cmlo.2d => (l1 < r1), (l2 < r2)
cmlo.2d => (l3 < r3), (l4 < r4)
cmeq.2d => (l1 = r1), (l2 = r2)
cmeq.2d => (l3 = r3), (l4 = r4)

Questions

With these values in four SIMD registers, I now wonder what the best strategy is to apply the & and | operators, and then reducing it to a single boolean.

Update

I just punched together a working implementation for "less than".

Basically, I replaced the 1s above with a duplicate condition, because A & A == A & 1.

Then, I lay out the three 2x2 squares in my matrix, and bitwise AND them. Now, I reduce with bitwise ORs - first from two vectors to one vector, then to one byte, then copy to ARM register, and test for 0xFF. Same pattern as for equality above.

Question above is still valid. I'm not sure if the code is optimal yet, and wonder if I missed some general SIMD pattern to do such stuff more efficiently. Also: Is NEON worth it for such cases, when the input operands come from memory?

bool lt256(const UInt256 *lhs, const UInt256 *rhs) {
    bool result;
    __asm__(// (l3 < r3) & (l3 < r3) |
            // (l3 = r3) & (l2 < r2) |
            // (l3 = r3) & (l2 = r2) & (l1 < r1) & (l1 < r1) |
            // (l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)

            "ld1.2d    { v0, v1 }, %1    \n\t"
            "ld1.2d    { v2, v3 }, %2    \n\t"

            // v0: [ l3 = r3 ] [ l2 = r2 ]
            // v1: [ l0 < r0 ] [ l1 < r1 ]
            // v2: [ l0 = r0 ] [ l1 = r1 ]
            // v3: [ l2 < r2 ] [ l3 < r3 ]
            // v4: [ l2 = r2 ] [ l3 = r3 ]
            "cmeq.2d   v4, v1, v3        \n\t"
            "cmlo.2d   v3, v1, v3        \n\t"
            "cmlo.2d   v1, v0, v2        \n\t"
            "cmeq.2d   v2, v0, v2        \n\t"
            "ext.16b   v0, v4, v4, 8     \n\t"

            // v2: [ l1 < r1 ] [ l1 = r1 ]
            // v1: [ l1 < r1 ] [ l0 < r0 ]
            "trn2.2d   v2, v1, v2        \n\t"
            "ext.16b   v1, v1, v1, 8     \n\t"

            // v1: [ l1 < r1  &  l1 < r1 ] [ l1 = r1  &  l0 < r0 ]
            "and.16b   v1, v2, v1        \n\t"

            // v2: [ l3 < r3 ] [ l3 = r3 ]
            // v3: [ l3 < r3 ] [ l2 < r2 ]
            "ext.16b   v2, v3, v0, 8     \n\t"
            "ext.16b   v3, v3, v3, 8     \n\t"

            // v3: [ l3 < r3  &  l3 < r3 ] [ l3 = r3  &  l2 < r2 ]
            "and.16b   v3, v2, v3        \n\t"

            // v2: [ l3 = r3 ] [ l3 = r3 ]
            // v4: [ l2 = r2 ] [ l2 = r2 ]
            "ext.16b   v2, v4, v0, 8     \n\t"
            "ext.16b   v4, v0, v4, 8     \n\t"

            // v2: [ l3 = r3  &  l2 = r2 ] [ l3 = r3  &  l2 = r2 ]
            "and.16b   v2, v2, v4        \n\t"

            // v1: [ l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ]
            //     [ lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
            "and.16b   v1, v2, v1        \n\t"

            // v1: [ l3 < r3 & l3 < r3 |
            //       l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ]
            //     [ l3 = r3 & l2 < r2 |
            //       lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
            "orr.16b   v1, v3, v1        \n\t"

            // b1: [ l3 < r3 & l3 < r3 |
            //       l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 |
            //       l3 = r3 & l2 < r2 |
            //       lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
            "umaxv.16b b1, v1            \n\t"
            "umov      %w0, v1.b[0]      \n\t"
            "cmp       %w0, 0xFF         \n\t"
            "cset      %w0, eq"
            : "=r" (result)
            : "m" (*lhs->value), "m" (*rhs->value)
            : "v0", "v1", "v2", "v3", "v4", "cc");
    return result;
}

Solution 1:^[1]

Benchmark with XCTest measureMetrics with a Swift-based test runner. Two 256-bit Ints are allocated. Then, an operation is repeated 100 million times on the same two ints, measurement is stopped, and each limb of the two ints is assigned a new random value with arc4random. A second run is performed with Instruments attached, and the CPU time distribution is noted for each instruction as a comment right next to it.

Equality (==)

For equality, SIMD seems to lose when the result is transferred from the SIMD registers back to the ARM register. SIMD is probably only worth it when the result is used in further SIMD calculations, or if longer ints than 256-bit are used (ld1 seems to be faster than ldp).

SIMD

bool result;
__asm__("ld1.2d    { v0, v1 }, %1    \n\t"    //  5.1%
        "ld1.2d    { v2, v3 }, %2    \n\t"    // 26.4%
        "cmeq.2d   v0, v0, v2        \n\t"
        "cmeq.2d   v1, v1, v3        \n\t"
        "uminp.16b v0, v0, v1        \n\t"    //  4.0%
        "uminv.16b b0, v0            \n\t"    // 26.7%
        "umov      %w0, v0.b[0]      \n\t"    // 32.9%
        "cmp       %w0, 0xFF         \n\t"    //  0.0%
        "cset      %w0, eq               "
        : "=r" (result)
        : "m" (*lhs->value), "m" (*rhs->value)
        : "v0", "v1", "v2", "v3", "cc");
return result;                                //  4.9% ("ret")

measured [Time, seconds] average: 11.558, relative standard deviation: 0.065%, values: [11.572626, 11.560558, 11.549322, 11.568718, 11.558530, 11.550490, 11.557086, 11.551803, 11.557529, 11.549782]

Standard

The winner here. The ccmp instruction really shines here :-) It is clear, though, that the problem is memory bound.

bool result;
__asm__("ldp       x8, x9, %1          \n\t"  // 33.4%
        "ldp       x10, x11, %2        \n\t"
        "cmp       x8, x10             \n\t"
        "ccmp      x9, x11, 0, eq      \n\t"
        "ldp       x8, x9, %1, 16      \n\t"  // 34.1%
        "ldp       x10, x11, %2, 16    \n\t"
        "ccmp      x8, x10, 0, eq      \n\t"  // 32.6%
        "ccmp      x9, x11, 0, eq      \n\t"
        "cset      %w0, eq             \n\t"
        : "=r" (result)
        : "m" (*lhs->value), "m" (*rhs->value)
        : "x8", "x9", "x10", "x11", "cc");
return result;

measured [Time, seconds] average: 11.146, relative standard deviation: 0.034%, values: [11.149754, 11.142854, 11.146840, 11.149392, 11.141254, 11.148708, 11.142293, 11.150491, 11.139593, 11.145873]

LLVM fails to detect that "ccmp" is a good instruction to use here, and is slower than the asm version above.

return
    lhs->value[0] == rhs->value[0] &
    lhs->value[1] == rhs->value[1] &
    lhs->value[2] == rhs->value[2] &
    lhs->value[3] == rhs->value[3];

Compiled to

ldp    x8, x9, [x0]                           // 24.1%
ldp    x10, x11, [x1]                         //  0.1%
cmp    x8, x10                                //  0.4%
cset   w8, eq                                 //  1.0%
cmp    x9, x11                                // 23.7%
cset   w9, eq
and    w8, w8, w9                             //  0.1%
ldp    x9, x10, [x0, #16]
ldp    x11, x12, [x1, #16]                    // 24.8%
cmp    x9, x11
cset   w9, eq                                 //  0.2%
and    w8, w8, w9
cmp    x10, x12                               //  0.3%
cset   w9, eq                                 // 25.2%
and    w0, w8, w9
ret                                           //  0.1%

measured [Time, seconds] average: 11.531, relative standard deviation: 0.040%, values: [11.525511, 11.529820, 11.541940, 11.531776, 11.533287, 11.526628, 11.531392, 11.526037, 11.531784, 11.533786]

Less Than (<)

(to be determined - will update later)

Solution 2:^[2]

Since the simple scalar ccmp implementation was the winner for the equality test, here's an equally simple scalar solution for less-than.

The approach for less-than above was based on lexicographic comparison, starting with the most significant limbs. I didn't see a good way to do that with ccmp. The problem is that in a branchless lexicographic compare, there are three possible states at each step: a previous limb compared less, a previous limb compared greater, all previous limbs compared equal. ccmp can't really keep track of three states. We could do it if ccmp's behavior when its condition is false were "do nothing", as with ARM32 conditionals, instead of "load flags with immediate".

So instead, here's an even more basic approach: do a multi-precision subtract, and check the carry flag at the end.

inline bool lt256(const uint64_t *a, const uint64_t *b) {
    const int limbs = 4;  // number of 64-bit chunks in a full number
    uint64_t a0,a1,b0,b1; // for scratch registers
    bool ret;
    asm(R"(
        ldp %[a0], %[a1], [%[a]]
        ldp %[b0], %[b1], [%[b]]
        subs xzr, %[a0], %[b0]
        sbcs xzr, %[a1], %[b1]
        ldp %[a0], %[a1], [%[a], #16]
        ldp %[b0], %[b1], [%[b], #16]
        sbcs xzr, %[a0], %[b0]
        sbcs xzr, %[a1], %[b1]
        )"
        : "=@cclo" (ret),
          [a0] "=&r" (a0), [a1] "=&r" (a1), [b0] "=&r" (b0), [b1] "=&r" (b1)
        : [a] "r" (a), [b] "r" (b),
          "m" (*(const uint64_t (*)[limbs])a),
          "m" (*(const uint64_t (*)[limbs])b)
        );
    return ret;
}

I chose to use a flag output operand for the result instead of explicitly writing a boolean to a register. (This feature didn't exist in a stable GCC release when the previous answer was written, and is still not supported by Clang for AArch64.) This could save an instruction, and a register, if the result will be branched on.

I also chose to do the loads within the asm. We could also use eight input operands and have the compiler do the loads, but then we would need eight registers instead of 4-6 as it stands. Worth trying if there is reason to think the limbs are already in general-purpose registers. Alternatively, you could reduce the register usage further by loading one pair of limbs at a time, instead of two, but at the cost of larger and probably slower code.

The zero register provides a convenient way to discard the numerical results of the subtractions, since we don't need them.

Performance should be pretty similar to the ccmp-based eq256, as both of them are essentially four subtracts in a dependency chain. Taking Cortex-A72 as an example, cmp/ccmp and subs/sbcs are all single-uop instructions that can execute on either of two integer pipelines. They don't say whether the flags are renamed, but if they are then you should be able to write two of these chains in series and have them execute in parallel.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Etan
Solution 2	Peter Cordes

'A64 Neon SIMD - 256-bit comparison

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]