'Are there ARM64 equivalents for x86-64 SSE2 integer SIMD GCC built-in functions?
Im trying to use an AMM-Algorithm (approximate-matrix-multiplication; on Apple's M1), which is fully based on speed and uses the x86 built-in functions listed below. Since using a VM for x86 slows down several crucial processes in the algorithm, I was wondering if there is another way to run it on ARM64.
I also could not find a fitting documentation for the ARM64 built-in functions, which could eventually help mapping some of the x86-64 instructions.
Used built-in functions:
__builtin_ia32_vec_init_v2si
__builtin_ia32_vec_ext_v2si
__builtin_ia32_packsswb
__builtin_ia32_packssdw
__builtin_ia32_packuswb
__builtin_ia32_punpckhbw
__builtin_ia32_punpckhwd
__builtin_ia32_punpckhdq
__builtin_ia32_punpcklbw
__builtin_ia32_punpcklwd
__builtin_ia32_punpckldq
__builtin_ia32_paddb
__builtin_ia32_paddw
__builtin_ia32_paddd
Solution 1:[1]
Normally you'd use intrinsics instead of the raw GCC builtin functions, but see https://gcc.gnu.org/onlinedocs/gcc/ARM-C-Language-Extensions-_0028ACLE_0029.html. The __builtin_arm_... and __builtin_aarch64_... functions like __builtin_aarch64_saddl2v16qi don't seem to be documented in the GCC manual the way the x86 ones are, just another sign they're not intended for direct use.
See also https://developer.arm.com/documentation/102467/0100/Why-Neon-Intrinsics- re intrinsics and #include <arm_neon.h>. GCC provides a version of that header, with the documented intrinsics API implemented using __builtin_aarch64_... GCC builtins.
As far as portability libraries, AFAIK not from the raw builtins, but SIMDe (https://github.com/simd-everywhere/simde) has portable implementations of immintrin.h Intel intrinsics like _mm_packs_epi16. Most code should be using that API instead of GNU C builtins, unless you're using GNU C native vectors (__attribute__((vector_size(16))) for portable SIMD without any ISA-specific stuff. But that's not viable when you want to take advantage of special shuffles and stuff.
And yes, ARM does have narrowing with saturation with instructions like vqmovn (https://developer.arm.com/documentation/dui0473/m/neon-instructions/vqmovn-and-vqmovun), so SIMDe can efficiently emulate pack instructions. That's AArch32, not 64, but hopefully there's an equivalent AArch64 instruction.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Peter Cordes |
