'shared memory memcpy performance issue

I'm doing some performance tunning on a shared memory based message queue. I found a strange phenomenon that I can't explain: I ran the same code for 3 epochs, the avg running time is getting better for each epoch.

Here's the minimal demo code:

inline uint64_t current_time_nanos() {
  static timespec tp;
  clock_gettime(CLOCK_REALTIME, &tp);
  return tp.tv_nsec + tp.tv_sec * 1000000000LLU;
}

void test() {
  static constexpr size_t TOTAL_SIZE = 16 * 1024 * 1024;
  static constexpr size_t COUNT = TOTAL_SIZE / sizeof(market_data);
  static_assert(TOTAL_SIZE % sizeof(market_data) == 0);

  market_data md;
  market_data *ptr =
      (market_data *)mmap(nullptr, TOTAL_SIZE, PROT_READ | PROT_WRITE,
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  if (MAP_FAILED == ptr) {
    printf("failed to mmap: %s\n", strerror(errno));
  }
  pthread_mutex_t mtx;
  assert(0 == pthread_mutex_init(&mtx, nullptr));
  std::atomic_uint64_t pos{0};
  mlock(ptr, TOTAL_SIZE);

  // Epoach 1
  auto st = current_time_nanos();
  for (int i = 0; i < COUNT; i++) {
    assert(0 == pthread_mutex_lock(&mtx));
    memcpy(&ptr[pos.fetch_add(1, std::memory_order_acq_rel) % COUNT], &md,
           sizeof(market_data));
    assert(0 == pthread_mutex_unlock(&mtx));
  }
  auto ed = current_time_nanos();
  printf("total used: %lu, avg = %f.\n", ed - st, double(ed - st) / COUNT);

  // Epoach 2
  pos = 0;
  st = current_time_nanos();
  for (int i = 0; i < COUNT; i++) {
    assert(0 == pthread_mutex_lock(&mtx));
    memcpy(&ptr[pos.fetch_add(1, std::memory_order_acq_rel) % COUNT], &md,
           sizeof(market_data));
    assert(0 == pthread_mutex_unlock(&mtx));
  }
  ed = current_time_nanos();
  printf("total used: %lu, avg = %f.\n", ed - st, double(ed - st) / COUNT);

  // Epoach 3
  pos = 0;
  st = current_time_nanos();
  for (int i = 0; i < COUNT; i++) {
    assert(0 == pthread_mutex_lock(&mtx));
    memcpy(&ptr[pos.fetch_add(1, std::memory_order_acq_rel) % COUNT], &md,
           sizeof(market_data));
    assert(0 == pthread_mutex_unlock(&mtx));
  }
  ed = current_time_nanos();
  printf("total used: %lu, avg = %f.\n", ed - st, double(ed - st) / COUNT);
}

I've run the code for multiple times, It can be sure that the avg execution time is getting better for each epoch. e.g epoch 3 has the best performance.

I wonder why this happens and How can I do some warmup that I could gain the performance from 3rd epoch without actually do memcpy?

sample result:

total used: 2479219, avg = 75.659760.
total used: 2092045, avg = 63.844147.
total used: 1718318, avg = 52.438904.

Here's the detailed info:

  • CPU: Intel Xeon 6348 2.6GHZ (Cascade-Lake)
  • Compiler: G++ 10.2.1 with O3 enabled
  • I've already use mlock to avoid page fault. it helps a lot. I also try to use _mm_prefetch but there's no performance gain actually.(Or may be I'm not using it correctly)


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source