'Faster way to do _mm256_set1_ps
Is there a faster way to do _mm256_set1_ps in assembly than the C intrinsic? It appears that the intrinsic compiles down to a sequence of vmovss, vshufps, vmovss, vshufps and vinsertf128, which even the intrinsics guide itself says is inefficient. I am wondering if there are alternative ways to do this. I realize that if there is Intel probably has implemented it, but doesn't hurt to ask....
Solution 1:[1]
While this has been partially addressed for some time, I found it as part of dealing with some similar issues and thought a formal answer might be of interest. I'm aware of two main cases.
- The constant for
_mm256_set1_ps()is in memory at a known address. As @Peter Cordes mentioned above in the comments, AVXvbroadcastssapplies in this case. - The constant is already in the low bits of a register. AVX2
vbroadcastssis suitable here (AVX requiresvpermilpsto set the lower 128 bits followed byvperm2f128to set the upper 128, I believe).
I've encountered inefficient code generation around this for a variety of reasons and have implemented my own variants of _mm_set1_ps() and _mm256_set1_ps() to encourage more efficient compilation. Don't feel I'm in a position to make more specific recommendations than checking the disassembly you're getting, however.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
