'Faster way to do _mm256_set1_ps

Is there a faster way to do _mm256_set1_ps in assembly than the C intrinsic? It appears that the intrinsic compiles down to a sequence of vmovss, vshufps, vmovss, vshufps and vinsertf128, which even the intrinsics guide itself says is inefficient. I am wondering if there are alternative ways to do this. I realize that if there is Intel probably has implemented it, but doesn't hurt to ask....



Solution 1:[1]

While this has been partially addressed for some time, I found it as part of dealing with some similar issues and thought a formal answer might be of interest. I'm aware of two main cases.

  1. The constant for _mm256_set1_ps() is in memory at a known address. As @Peter Cordes mentioned above in the comments, AVX vbroadcastss applies in this case.
  2. The constant is already in the low bits of a register. AVX2 vbroadcastss is suitable here (AVX requires vpermilps to set the lower 128 bits followed by vperm2f128 to set the upper 128, I believe).

I've encountered inefficient code generation around this for a variety of reasons and have implemented my own variants of _mm_set1_ps() and _mm256_set1_ps() to encourage more efficient compilation. Don't feel I'm in a position to make more specific recommendations than checking the disassembly you're getting, however.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1