'Debug printf performance when compiled with clang cfi

Setup

I have a simple helloworld program:

// content of main.c
#include <stdio.h>
#include <limits.h>

int main() {
    for (int i = 0; i < INT_MAX; ++i) {
        printf("simply helloworld!\n");
    }

    return 0;
}

I compile a baseline version with clang 13.0.0 using clang -flto=thin -fvisibility=hidden -fuse-ld=lld main.c

To experiment with CFI, I compile another version using clang -flto=thin -fsanitize=cfi -fsanitize-cfi-cross-dso -fno-sanitize-cfi-canonical-jump-tables -fsanitize-trap=cfi -fvisibility=hidden -fuse-ld=lld main.c

Expectation

I am expecting negligible performance overhead as I am only calling into a shared library that I expect will run the same code for both. The disassembly for main function for both binaries look the same.

Reality

The baseline version completes execution in ~27s while the cfi version completes execution in ~32s. Using perf stat -e instructions <binary> I can see that the cfi version runs ~100,000,000,000 more instructions. With perf record then perf diff, I can see that the difference is primarily in two functions _pthread_cleanup_push_defer and _pthread_cleanup_pop_restore that the cfi version runs. Using gdb, these functions are called as the call stack of printf gets deeper.

Question

How do I begin to explain the performance difference between these two binaries? What makes a simple call to printf call two different versions of itself for two different binaries?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source