'Why does GCC allocate more stack memory than needed?

I'm reading "Computer Systems: A Programmer's Perspective, 3/E" (CS:APP3e) and the following code is an example from the book:

long call_proc() {
    long  x1 = 1;
    int   x2 = 2;
    short x3 = 3;
    char  x4 = 4;
    proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
    return (x1+x2)*(x3-x4);
}

The book gives the assembly code generated by GCC:

long call_proc()
call_proc:
    ; Set up arguments to proc
    subq    $32, %rsp           ; Allocate 32-byte stack frame
    movq    $1, 24(%rsp)        ; Store 1 in &x1
    movl    $2, 20(%rsp)        ; Store 2 in &x2
    movw    $3, 18(%rsp)        ; Store 3 in &x3
    movb    $4, 17(%rsp)        ; Store 4 in &x4
    leaq    17(%rsp), %rax      ; Create &x4
    movq    %rax, 8(%rsp)       ; Store &x4 as argument 8
    movl    $4, (%rsp)          ; Store 4 as argument 7
    leaq    18(%rsp), %r9       ; Pass &x3 as argument 6
    movl    $3, %r8d            ; Pass 3 as argument 5
    leaq    20(%rsp), %rcx      ; Pass &x2 as argument 4
    movl    $2, %edx            ; Pass 2 as argument 3
    leaq    24(%rsp), %rsi      ; Pass &x1 as argument 2
    movl    $1, %edi            ; Pass 1 as argument 1
    ; Call proc
    call    proc
    ; Retrieve changes to memory
    movslq  20(%rsp), %rdx      ; Get x2 and convert to long
    addq    24(%rsp), %rdx      ; Compute x1+x2
    movswl  18(%rsp), %eax      ; Get x3 and convert to int
    movsbl  17(%rsp), %ecx      ; Get x4 and convert to int
    subl    %ecx, %eax          ; Compute x3-x4
    cltq                        ; Convert to long
    imulq   %rdx, %rax          ; Compute (x1+x2) * (x3-x4)
    addq    $32, %rsp           ; Deallocate stack frame
    ret                         ; Return

I can understand this code: the compiler allocates 32 bytes of space on the stack, of which the first 16 bytes hold the arguments passed to proc and the last 16 bytes hold 4 local variables.

Then I tested this code on GCC 11.2, using the optimization flag -Og, and got this assembly code:

call_proc():
        subq    $24, %rsp
        movq    $1, 8(%rsp)
        movl    $2, 4(%rsp)
        movw    $3, 2(%rsp)
        movb    $4, 1(%rsp)
        leaq    1(%rsp), %rax
        pushq   %rax
        pushq   $4
        leaq    18(%rsp), %r9
        movl    $3, %r8d
        leaq    20(%rsp), %rcx
        movl    $2, %edx
        leaq    24(%rsp), %rsi
        movl    $1, %edi
        call    proc(long, long*, int, int*, short, short*, char, char*)
        movslq  20(%rsp), %rax
        addq    24(%rsp), %rax
        movswl  18(%rsp), %edx
        movsbl  17(%rsp), %ecx
        subl    %ecx, %edx
        movslq  %edx, %rdx
        imulq   %rdx, %rax
        addq    $40, %rsp
        ret

I noticed that gcc first allocated 24 bytes for 4 local variables. Then it uses pushq to add 2 arguments to the stack, so the final code uses addq $40, %rsp to free stack space.

Compared to the code in the book, GCC allocates 8 more bytes of space here, and it doesn't seem to use the extra space. Why does it need the extra space?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source