'Measuring overhead of page-table locks in C/C++

There are two locks on the PMD and PTE levels of the page table in the Linux kernel. Each time a thread/process allocates/maps a memory it should hold one of these locks to update the page table accordingly. It is obvious that as the number of threads increases the race for holding the locks increases as well. This may degrade the memory mapping throughput since many threads hold the spinlock.

What I'd like to measure for a task, is the worst-case overhead of these locks on memory mapping throughput but I have no idea how to measure it.

I tried using malloc in an infinite-loop as I increase the number of threads running the same loop. I check the /proc/{pid}/maps for each set of running threads to count the number of mapped memories. But I'm not sure if it is the correct way. Besides, this method consumes a lot of memory.

Is there any efficient way to measure the worst-case overhead of these locks?



Solution 1:[1]

A lot of the comments are correct, however I thought I might try my hand at a full response. Firstly, using malloc will not enable you to have explicit control over page mappings as a comment said the malloc part of stdlib will actually allocate a huge chunk of memory after the first allocation. Secondly when creating a new thread, this will use the same address-space, so there will be no additional mappings created.

I'm going to assume you want to do this from user space, because from kernel space, you can do a lot of things to make this exploration somewhat degenerate (for example you can just try and map pages to the same location). Instead you want to allocate anonymous pages using mmap. Mmap is an explicit call to create a Virtual Memory Entry so that when that particular page is accessed for the first time, the kernel can actually put some blank physical memory at that location. It is the first access to that location that causes the fault, and that first access which will actually use the locks in the PTE and PUD.

Ensuring Good Benchmarking Procedure:

  1. If you are just trying to stress the page tables you might also want to turn off Transparent Huge Pages within that process. (The syscall to look into is prnctl with the flag DISABLE_THP). Run this before spawning any child processes.
  2. Pin Threads to cores using cpuset.
  3. You want to explicitly control your region of interest, so you want to pick specific addresses for each thread that all share the same page table. This way you ensure that the maximum number of locks is used.
  4. Use a psuedo random function to write to the location that has a different seed for each thread.
  5. Compare with a baseline that does the exact same thing but that has very different parts of the address space that is stressed.
  6. Make sure that as little is different between the baseline and the overly contented workload.
  7. Do not over-subscribe the processor, this will make the overhead due to context-switches which are notorious to root out.
  8. Make sure to start capturing timing after the threads are created and stop it before they are destroyed.

What does this translate in each thread:

address = <per-thread address>
total = 0;
for(int i = 0; i < N; i++) 
{
 uint64_t* x = (uint64_t*) mmap((void*) address, 4096, PROT_READ | PROT_WRITE, 
     MAP_ANONYMOUS, -1, 0); //Maps one page anonymously
 assert(x);
 *x ^= pseudo_rand(); // Accesses the page and causes the allocation
 total += *x; // For fun
 int res = munmap((void*) x, 4096); //Deallocates the page (similar locks)
 assert(!res);
}

The big take aways are:

  1. Use mmap and explicitly access the allocated location to actually control individual page allocation.
  2. The compactness of addresses determines what locks are acquired.
  3. Measuring kernel and virtual memory things requires strict discipline in benchmark procedure.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 tewaro