'What does MaskStore do behind the scenes?
my main programming language is C# and lately I've been trying to learn about vector programming and some simd instructions on the intel x86 axv2 for self-learning purposes. I came across the instruction MaskStore which maps to the axv2 instruction:
VPMASKMOVD m256, ymm, ymm
I'm just wondering how does this instruction work behind the scenes, programmatically in pseudo code is it something like:
for n in vector.values
if (highest bit of mask is set for vector n)
{
address = source vector[n]
}
Solution 1:[1]
Yes, that's correct.
The asm manual https://www.felixcloutier.com/x86/vmaskmov#vmaskmovpd---256-bit-store documents it with pseudo-code like that. Intel's C/C++ intrinsics guide has less detailed but similar documentation: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,Other&ig_expand=5420,5420,5623,5626,7359,5403,5039&text=maskmov.
Note that it's an AVX1 instruction, not AVX2. Supported since Sandybridge on Intel, AMD since Bulldozer.
Not very efficiently on AMD CPUs, though as per https://uops.info/ testing data. Masked stores aren't easy to emulate, unlike maskload which can just do a regular load and then mask, only needing special hardware and/or a microcode assist if it has to do fault suppression when a masked-out part of the load would touch an unmapped page.
On Intel, masked stores are also first-class operations, only 3 uops for memory-destination vmaskmovpd mem, ymm, ymm on Intel since Skylake (p0 + p23+p4), down from 4 in Sandybridge/Haswell (p0+p1 + p23+p4)
Probably not a coincidence that Skylake has AVX-512 hardware, even though it's not enabled/exposed in the "client" chips; perhaps internally works as a compare-into-mask uop, and then a native masked-store uop. Without micro-fusion of store-address and store-data, that's 3 uops total. The ALU uop can only run on port-0 on Skylake, the same port required by vpmovq2m k, ymm, unlike vptestmq k, ymm,ymm, so we can infer that it probably uses a uop like vpmovq2m to generate the internal mask-register value from the mask vector operand.
On AMD Zen2, mask load is single-uop (much improved over Zen1), but mask store is still 10 or 19 uops for XMM or YMM vector width, respectively. 4c or 6c throughput, so much slower than load / AND or blend / store if you can safely do a non-atomic RMW. (Including a load/store of elements you don't modify.)
Maskmove has some use-case corner cases that can be disastrous, like taking a microcode assist on every instruction if used with an all-false mask on a read-only page.
That can happen if a page of memory hasn't been dirties yet and is still copy-on-write mapped so it's actually read-only as far as HW is concerned (in the page tables). And then if you loop over it using load / compare / maskstore to conditionally replace some values or something, you might never dirty the page if no values need replacing, so multiple instructions take a slow microcode assist.
But OTOH, it can be a bit faster than store/mask/reload on Skylake for the same array-replacement task if you don't hit that bad case.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Peter Cordes |
