'shared memory memcpy performance issue
I'm doing some performance tunning on a shared memory based message queue. I found a strange phenomenon that I can't explain: I ran the same code for 3 epochs, the avg running time is getting better for each epoch.
Here's the minimal demo code:
inline uint64_t current_time_nanos() {
static timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return tp.tv_nsec + tp.tv_sec * 1000000000LLU;
}
void test() {
static constexpr size_t TOTAL_SIZE = 16 * 1024 * 1024;
static constexpr size_t COUNT = TOTAL_SIZE / sizeof(market_data);
static_assert(TOTAL_SIZE % sizeof(market_data) == 0);
market_data md;
market_data *ptr =
(market_data *)mmap(nullptr, TOTAL_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (MAP_FAILED == ptr) {
printf("failed to mmap: %s\n", strerror(errno));
}
pthread_mutex_t mtx;
assert(0 == pthread_mutex_init(&mtx, nullptr));
std::atomic_uint64_t pos{0};
mlock(ptr, TOTAL_SIZE);
// Epoach 1
auto st = current_time_nanos();
for (int i = 0; i < COUNT; i++) {
assert(0 == pthread_mutex_lock(&mtx));
memcpy(&ptr[pos.fetch_add(1, std::memory_order_acq_rel) % COUNT], &md,
sizeof(market_data));
assert(0 == pthread_mutex_unlock(&mtx));
}
auto ed = current_time_nanos();
printf("total used: %lu, avg = %f.\n", ed - st, double(ed - st) / COUNT);
// Epoach 2
pos = 0;
st = current_time_nanos();
for (int i = 0; i < COUNT; i++) {
assert(0 == pthread_mutex_lock(&mtx));
memcpy(&ptr[pos.fetch_add(1, std::memory_order_acq_rel) % COUNT], &md,
sizeof(market_data));
assert(0 == pthread_mutex_unlock(&mtx));
}
ed = current_time_nanos();
printf("total used: %lu, avg = %f.\n", ed - st, double(ed - st) / COUNT);
// Epoach 3
pos = 0;
st = current_time_nanos();
for (int i = 0; i < COUNT; i++) {
assert(0 == pthread_mutex_lock(&mtx));
memcpy(&ptr[pos.fetch_add(1, std::memory_order_acq_rel) % COUNT], &md,
sizeof(market_data));
assert(0 == pthread_mutex_unlock(&mtx));
}
ed = current_time_nanos();
printf("total used: %lu, avg = %f.\n", ed - st, double(ed - st) / COUNT);
}
I've run the code for multiple times, It can be sure that the avg execution time is getting better for each epoch. e.g epoch 3 has the best performance.
I wonder why this happens and How can I do some warmup that I could gain the performance from 3rd epoch without actually do memcpy?
sample result:
total used: 2479219, avg = 75.659760.
total used: 2092045, avg = 63.844147.
total used: 1718318, avg = 52.438904.
Here's the detailed info:
- CPU: Intel Xeon 6348 2.6GHZ (Cascade-Lake)
- Compiler: G++ 10.2.1 with O3 enabled
- I've already use mlock to avoid page fault. it helps a lot. I also try to use _mm_prefetch but there's no performance gain actually.(Or may be I'm not using it correctly)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
