'Vectorization without intrinsics for different architectures
I have done my fair share of optimizations for SSE/AVX/AVX2/AVX512, ending up with some modified version of "vectorclass". Now I face porting to Apple's M1. I only use there "classed intrinsics" when needed (when it's too difficult for LLVM to figure it out). But here I now I can either use some wrapper for AVX->NEON, or I'm thinking... how about using structures like this:
struct FLOAT4x
{
float X1, X2, X3, X4;
void operator +=(const FLOAT4x& x) { X1 += x.X1; X2 += x.X2; X3 += x.X3; X4 += x.X4; };
};
FLOAT4x vec1, vec2;
vec1 += vec2;
Could one rely on the idea that since these operations are nicely aligned for LLVM to "understand that it could be vectorized", it will do that? Since the SIMD processing is pretty much identical on all platforms, it would quite help with the development and also make things less error prone.
Solution 1:[1]
Sounds like you want something like SIMDe or similar. A library that tries to give generic SIMD solution, that can then adapt to each particular.
LLVM optimizations will change with each version - generally improving, but it does mean guarantees can't be made, and with different SIMD solutions having different commands etc, there won't be a single "best" way to structure things for auto-vectorization in all cases. There will be best practices like eg these for Neon (with further links off it), that will help though. Short simple loops, contiguous aligned memory etc.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BenClark |
