'Manipulate vector register as float32x4_t C variable in ARM

I'm using inline assembly in ARM for a scientific application. In my assembly code, I have to (see note in the end) nominally indicate which vector registers I want to use. For example, in my code, I have asm volatile("fadd v12.4S, v12.4S, v7.4S") to do a vector floating-point add between vector registers 7 and 12, storing the result in vector register 12, among other inline assembly instructions.

After the 'critical' assembly code part, I want to retrieve the said resulting variables and operate on them as arm neon variables in C. In my case, vectors will have 4x 32-bit variables, so they will be of type float32x4_t.

So far I can do something like:

float32_t my_var[4];
asm volatile("st1  {v12.4S}, [%[addr]]\n\t" : : [addr]"r"(my_var) :  "x0",  "x1");
/*from here on I can operate on my_var[0], my_var[1], etc without having to write asm code*/

I.e., I'm using a vector store instruction to write the contents of the vector register into a C vector variable. This will cause subsequent accesses to that variable to be loads, which I want to avoid because the variable exists in a register already.

I'd like to have something similar to

float32x4_t my_var;
asm volatile("some code that make sure my_var 'binds' to vector 12");
/*from here on I could use intrinsic such as vgetq_lane_f32(my_var, 1) to get each value of the vector and not having to write asm code also*/

However, I could not find a way to do the second approach. This old question had similar concerns, but it was for an older ARM ISA (I'm targeting v8), and to load from (not store to) a single (not vector) variable.

Note: I cannot use intrinsic calls from the beginning (which would make things much easier), because I'm modeling new instructions in a simulator, and I need to write low-level assembly up to that part.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source