'Using ymm registers as a "memory-like" storage location

Consider the following loop in x86:

; on entry, rdi has the number of iterations
.top:
; some magic happens here to calculate a result in rax
mov [array + rdi * 8], rax ; store result in output array
dec rdi
jnz .top

It's straightforward: something calculates a result in rax (not shown) and then we store the result into an array, in reverse order as we index with rdi.

I would like to transform the above loop not make any writes to memory (we can assume the non-shown calculation doesn't write to memory).

As long as the loop count in rdi is limited, I could use the ample space (512 bytes) provided by the ymm regs to save the values instead, but it seems awkward to actually do this, since you can't "index" an arbitrary register.

One approach would be to always shuffle the entire "array" of ymm registers by one element and then insert the element in the newly freed up position.

Something like this:

vpermq  ymm3, ymm3, 10_01_00_11b ; left rotate ymm by qword
vpermq  ymm2, ymm2, 10_01_00_11b ; left rotate ymm by qword
vpermq  ymm1, ymm1, 10_01_00_11b ; left rotate ymm by qword
vpermq  ymm0, ymm0, 10_01_00_11b ; left rotate ymm by qword

vblenddd ymm3, ymm3, ymm2, 3     ; promote one qword of ymm2 to ymm3
vblenddd ymm2, ymm2, ymm1, 3     ; promote one qword of ymm1 to ymm2
vblenddd ymm1, ymm1, ymm0, 3     ; promote one qword of ymm0 to ymm1

pinsrq   xmm0, rax, 0  ; playing with mixed-VEX mode fire (see Peter's answer)

This shows only handling four of the 16 registers, so evidently to do all 16 it will be a lot of code (32 instructions).

Is there a better way?

Unpredictable branches are undesirable, but we can still consider solutions that use them.

Solution 1:^[1]

A few options you could consider:

Unroll

If you unroll your loop (which necessarily has a limited number of iterations since the storage available in ymm registers is limited to 64 qwords) you would have the chance to use hard-coded logic to insert your result from rax directly in the right place, e.g,. with pinrsq or movq sometimes combined with a shuffle to let you access the high lanes. It will probably take only 1.25 instructions per iteration, much better than 32!

Vertical Lanes

Your current shuffling solution could be characterized as horizontal rotation through a register, carrying from the high qword of ymm N into the low qword of ymm N+1. That is, adjacent elements within one register are logically adjacent in your scheme. Instead, you could do a vertical rotation, where instead elements in a given qword lane are logically adjacent to the elements in the same lane in registers ymm N-1 and ymm N+1. This avoids the need for any horizontal shuffling, and most of the shifting needs only a single register-register mov. You only need special handling for the first and last registers to wrap your elements around into the next lane.

Something like this:

; shift all lanes "up"
vmovdqa ymm15, ymm3
vmovdqa  ymm3, ymm2
vmovdqa  ymm2, ymm1
vmovdqa  ymm1, ymm0

; wrap from the top register back to ymm0, shifting to the left by 1
vpermq   ymm0, ymm15, 10_01_00_11b
; store new element
vpinsrq  ymm0, rax, 0

That's about as simple as you are going to get for a general "shift every element" strategy: a single vmovdqa per used ymm register, plus to additional instructions to do the wraparound and new element insertion. As far as vector operations go register-register moves are pretty much faster than any other type of operation since they can be move-eliminated (0 latency) and can execute 4 per cycle.

This approach does need a temporary register (ymm15 in the example above) and I can't think of an easy way to eliminate that, so you'd be able to use at most 15 registers as part of your "queue".

Indirect Jump

You could do a calculated indirect jump based on the iteration count to a short (2-4 instructions) sequence that puts the element into the right place. Basically a vpinsrq and in some cases some additional shuffling to access the high lane.

This type of table could be fully general, i.e., allow writes to arbitrary indexes in any order, but if you knew you were indexing sequentially as above, you could simplify the table using that assumption (i.e,. you can handle the high lane by writing into the low elements first and then using a vinserti128 or something like that to move them both into the high lane at the right time.

This table will presumably mispredict the first time though. After that, it may or may not, depending on the pattern and the strength of the indirect branch predictor.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1