'gcc optimization, const static object, and restrict

I'm working on an embedded project and I'm trying add more structure to some of the code, which use macros to optimize access to registers for USARTs. I'd like to organize preprocessor #define'd register addresses into const structures. If I define the structs as compound literals in a macro and pass them to inline'd functions, gcc has been smart enough the bypass the pointer in the generated assembly and hardcode the structure member values directly in the code. E.g.:

C1:

struct uart {
   volatile uint8_t * ucsra, * ucsrb, *ucsrc, * udr;
   volitile uint16_t * ubrr;
};

#define M_UARTX(X)                  \
    ( (struct uart) {               \
        .ucsra = &UCSR##X##A,       \
        .ucsrb = &UCSR##X##B,       \
        .ucsrc = &UCSR##X##C,       \
        .ubrr  = &UBRR##X,          \
        .udr   = &UDR##X,           \
    } )


void inlined_func(const struct uart * p, other_args...) {
    ...
    (*p->ucsra) = 0;
    (*p->ucsrb) = 0;
    (*p->ucsrc) = 0;
}
...
int main(){
     ...
     inlined_func(&M_UART(0), other_parms...);
     ...
}

Here UCSR0A, UCSR0B, &c, are defined as the uart registers as l-values, like

#define UCSR0A (*(uint8_t*)0xFFFF)

gcc was able to eliminate the structure literal entirely, and all assignments like that shown in inlined_func() write directly into the register address, w/o having to read the register's address into a machine register, and w/o indirect addressing:

A1:

movb $0, UCSR0A
movb $0, UCSR0B
movb $0, UCSR0C

This writes the values directly into the USART registers, w/o having to load the addresses into a machine register, and so never needs to generate the struct literal into the object file at all. The struct literal becomes a compile-time structure, with no cost in the generated code for the abstraction.

I wanted to get rid of the use of the macro, and tried using a static constant struct defined in the header:

C2:

#define M_UART0 M_UARTX(0)
#define M_UART1 M_UARTX(1)

static const struct uart * const uart[2] = { &M_UART0, &M_UART1 };
....
int main(){
     ...
     inlined_func(uart[0], other_parms...);
     ...
}

However, gcc cannot remove the struct entirely here:

A2:

movl __compound_literal.0, %eax
movb $0, (%eax)
movl __compound_literal.0+4, %eax
movb $0, (%eax)
movl __compound_literal.0+8, %eax
movb $0, (%eax)

This loads the register addresses into a machine register, and uses indirect addressing to write to the register. Does anyone know anyway I can convince gcc to generate A1 assembly code for C2 C code? I've tried various uses of the __restrict modifier, with no avail.

Solution 1:^[1]

After many years of experience with UARTs and USARTs, I have come to these conclusions:

Don't use a `struct` for a 1:1 mapping with UART registers.

Compilers can add padding between struct members without your knowledge, thus messing up the 1:1 correspondence.

Writing to UART registers is best done directly or through a function.

Remember to use volatile modifier when defining pointers to the registers.

Very little performance gain with Assembly language

Assembly language should only be used if the UART is accessed through processor ports rather than memory-mapped. The C language has no support for ports. Accessing UART registers through pointers is very efficient (generate an assembly language listing and verify). Sometimes, it may take more time to code in assembly and verify.

Isolate UART functionality into a separate library

This is a good candidate. Besides, once the code has been tested, let it be. Libraries don't have to be (re)compiled all the time.

Solution 2:^[2]

Using structs "across compile domains" is a cardinal sin in my book. Basically using a struct to point at something, anything, file data, memory, etc. And the reason is that it will fail, it is not reliable, no matter the compiler. There are many compiler specific flags and pragmas for this, the better solution is to just not do it. You want to point at address plus 8, point at address plus 8, use a pointer or an array. In this specific case I have had way too many compilers fail to do that as well and I write assembler PUT32/GET32 PUT16/GET16 functions to guarantee that the compiler doesnt mess with my register accesses, like structs, you will get burned one day and have a hell of a time figuring out why your 32 bit register only had 8 bits written to it. The overhead of the jump to the function is worth the peace of mind and the reliability and portability of the code. Also this makes your code extremely portable, you can put wrappers in for the put and get functions to cross networks, run your hardware in an hdl simulator and reach into the simulation to read/write registers, etc, with a single chunk of code that doesnt change from simulation to embedded to os device driver to application layer function.

Solution 3:^[3]

Based on the register set, it looks like you are using an 8-bit Atmel AVR microncontroller (or something extremely similar). I'll show you some things I've used for Atmel's 32-bit ARM MCUs which is a slightly modified version of what they ship in their device packs.

Code Notation

I'm using various macros that I'm not going to include here, but they are defined to do basic operations or paste types (like UL) onto numbers. They are hidden in macros for the cases where something is not allowed (like in assembly). Yes, these are easy to break - it's on the programmer not to shoot themselves in the foot:

#define _PPU(_V) (_V##U)   /* guarded with #if defined(__ASSEMBLY__) */
#define _BV(_V)  (_PPU(1) << _PPU(_V))   /* Variants for U, L, UL, etc */

There are also typdefs for specific length registers. Example:

/* Variants for 8, 16, 32-bit, RO, WO, & RW */
typedef volatile uint32_t rw_reg32_t; 
typedef volatile const uint32_t ro_reg32_t;

The classic #define method

You can define the peripheral address with any register offsets...

#define PORT_REG_ADDR           _PPUL(0x41008000)
#define PORT_ADDR_DIR           (PORT_REG_ADDR + _PPU(0x00))
#define PORT_ADDR_DIRCLR        (PORT_REG_ADDR + _PPU(0x04))
#define PORT_ADDR_DIRSET        (PORT_REG_ADDR + _PPU(0x08))
#define PORT_ADDR_DIRTGL        (PORT_REG_ADDR + _PPU(0x0C))

And de-referenced pointers to the register addresses...

#define PORT_DIR        (*(rw_reg32_t *)PORT_ADDR_DIR)
#define PORT_DIRCLR     (*(rw_reg32_t *)PORT_ADDR_DIRCLR)
#define PORT_DIRSET     (*(rw_reg32_t *)PORT_ADDR_DIRSET)
#define PORT_DIRTGL     (*(rw_reg32_t *)PORT_ADDR_DIRTGL)

And then directly set values in the register:

PORT_DIRSET = _BV(0) | _BV(1) | _BV(2);

Compiling in GCC with some other startup code...

arm-none-eabi-gcc -c -x c -mthumb -mlong-calls -mcpu=cortex-m4 -pipe
-std=c17 -O2 -Wall -Wextra -Wpedantic main.c

 [SIZE]    : Calculating size from ELF file

   text    data     bss     dec     hex
    924       0   49184   50108    c3bc

With disassembly:

00000000 <main>:
#include "defs/hw-v1.0.h"

void main (void) {

        PORT_DIRSET = _BV(0) | _BV(1) | _BV(2);
   0:   4b01            ldr     r3, [pc, #4]    ; (8 <main+0x8>)
   2:   2207            movs    r2, #7
   4:   601a            str     r2, [r3, #0]

}
   6:   4770            bx      lr
   8:   41008008        .word   0x41008008

The "new" structured method

You still define a base address as before as well as some numerical constants (like some number of instances), but instead of defining individual register addresses, you create a structure that models the peripheral. Note, I manually include some reserved space at the end for alignment. For some peripherals, there will be reserved chunks between other registers - it all depends on that peripheral memory mapping.

typedef struct PortGroup {
    rw_reg32_t  DIR;
    rw_reg32_t  DIRCLR;
    rw_reg32_t  DIRSET;
    rw_reg32_t  DIRTGL;
    rw_reg32_t  OUT;
    rw_reg32_t  OUTCLR;
    rw_reg32_t  OUTSET;
    rw_reg32_t  OUTTGL;
    ro_reg32_t  IN;
    rw_reg32_t  CTRL;
    wo_reg32_t  WRCONFIG;
    rw_reg32_t  EVCTRL;
    rw_reg8_t   PMUX[PORT_NUM_PMUX];
    rw_reg8_t   PINCFG[PORT_NUM_PINFCG];
    reserved8_t reserved[PORT_GROUP_RESERVED];
} PORT_group_t;

Since the PORT peripheral has four units, and the PortGroup structure is packed to exactly model the memory mapping, I can create a parent structure that contains all of them.

typedef struct Port  {
    PORT_group_t    GROUP[PORT_NUM_GROUPS];
} PORT_t;

And the final step is to associate this structure with an address.

#define PORT    ((PORT_t *)PORT_REG_ADDR)

Note, this can still be de-referenced as before - it's a matter of style choice.

#define PORT    (*(PORT_t *)PORT_REG_ADDR)

And now to set the register value as before...

PORT->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);

Compiling (and linking) with the same options, this produces identical size info and disassembly:

Disassembly of section .text.startup.main:

00000000 <main>:
#include "defs/hw-v1.0.h"

void main (void) {

        PORT->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
   0:   4b01            ldr     r3, [pc, #4]    ; (8 <main+0x8>)
   2:   2207            movs    r2, #7
   4:   609a            str     r2, [r3, #8]

}
   6:   4770            bx      lr
   8:   41008000        .word   0x41008000

Reusable Code

The first method is straightforward, but requires a lot of manual definitions and some ugly macros to if you have more than one peripheral. What if we had 2 different PORT peripherals at different addresses (similar to a device that has more than one USART). We can just create multiple structured PORT pointers:

#define PORT0   ((PORT_t *)PORT0_REG_ADDR)
#define PORT1   ((PORT_t *)PORT1_REG_ADDR)

Calling them individually looks like what you'd expect:

PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);

Compiling results in:

 [SIZE]    : Calculating size from ELF file

   text    data     bss     dec     hex
    936       0   49184   50120    c3c8

Disassembly of section .text.startup.main:

00000000 <main>:
#include "defs/hw-v1.0.h"

void main (void) {

        PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
   0:   4903            ldr     r1, [pc, #12]   ; (10 <main+0x10>)

        PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
   2:   4b04            ldr     r3, [pc, #16]   ; (14 <main+0x14>)
        PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
   4:   2007            movs    r0, #7
        PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
   6:   2270            movs    r2, #112        ; 0x70
        PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
   8:   6088            str     r0, [r1, #8]
        PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
   a:   609a            str     r2, [r3, #8]

}
   c:   4770            bx      lr
   e:   bf00            nop
  10:   41008000        .word   0x41008000
  14:   4100a000        .word   0x4100a000

And the final step to make it all reusable...

static PORT_t * const PORT[] = {PORT0, PORT1};

static inline void
PORT_setDir(const uint8_t unit, const uint8_t group, const uint32_t pins) {
    PORT[unit]->GROUP[group].DIRSET = pins;
}
/* ... */
PORT_setDir(0, 0, _BV(0) | _BV(1) | _BV(2));
PORT_setDir(1, 0, _BV(4) | _BV(5) | _BV(6));

And compiling will give identical size and (basically) disassembly as before.

Disassembly of section .text.startup.main:

00000000 <main>:

static PORT_t * const PORT[] = {PORT0, PORT1};

static inline void
PORT_setDir(const uint8_t unit, const uint8_t group, const uint32_t pins) {
        PORT[unit]->GROUP[group].DIRSET = pins;
   0:   4903            ldr     r1, [pc, #12]   ; (10 <main+0x10>)
   2:   4b04            ldr     r3, [pc, #16]   ; (14 <main+0x14>)
   4:   2007            movs    r0, #7
   6:   2270            movs    r2, #112        ; 0x70
   8:   6088            str     r0, [r1, #8]
   a:   609a            str     r2, [r3, #8]
void main (void) {
        PORT_setDir(0, 0, _BV(0) | _BV(1) | _BV(2));
        PORT_setDir(1, 0, _BV(4) | _BV(5) | _BV(6));
}
   c:   4770            bx      lr
   e:   bf00            nop
  10:   41008000        .word   0x41008000
  14:   4100a000        .word   0x4100a000

I would clean it up a bit more with a module library header, enumerated constants, etc. But this should give someone a starting point. Note, in these examples, I am always calling a CONSTANT unit and group. I know exactly what I'm writing to, I just want reusable code. More instructions will be (probably) be needed if the unit or group cannot be optimized to compile time constants. Speaking of which, if optimizations are not used, all of this goes out the window. YMMV.

Side Note on Bit Fields

Atmel further breaks a peripheral structure into individual typedef'd registers that have named bitfields in a union with the size of the register. This is the ARM CMSIS way, but it's not great IMO. Contrary to the information in some of the other answers, I know exactly how a compiler will pack this structure; however, I do not know how it will arrange bit fields without using special compiler attributes and flags. I would rather explicitly set and mask defined register bit field constant values. It also violates MISRA (as does some of what I've done here...) if you are worried about that.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Thomas Matthews
Solution 2	old_timer
Solution 3	Kurt E. Clothier