'Find the optimal combination of setting values for `number of processes` and `OMP_NUM_THREADS` in a particular computing task

The testing environment is Ubuntu 20.04.3 LTS installed on a machine with dual Intel Xeon E5-2699 v4 and Supermicro X10DAi motherboard. I try to compile and test VASP.6.3.0 with recent/latest Intel oneAPI base and hpc toolkits.

The test commands are as follows:

VASP_TESTSUITE_EXE_STD="mpirun -np $nranks -genv OMP_NUM_THREADS=$nthrds -genv I_MPI_PIN_DOMAIN=omp -genv KMP_AFFINITY=verbose,granularity=fine,compact,1,0 -genv KMP_STACKSIZE=512m /home/werner/Public/hpc/vasp/vasp.6.3.0/testsuite/../bin/vasp_std"
VASP_TESTSUITE_EXE_NCL="mpirun -np $nranks -genv OMP_NUM_THREADS=$nthrds -genv I_MPI_PIN_DOMAIN=omp -genv KMP_AFFINITY=verbose,granularity=fine,compact,1,0 -genv KMP_STACKSIZE=512m /home/werner/Public/hpc/vasp/vasp.6.3.0/testsuite/../bin/vasp_ncl"
VASP_TESTSUITE_EXE_GAM="mpirun -np $nranks -genv OMP_NUM_THREADS=$nthrds -genv I_MPI_PIN_DOMAIN=omp -genv KMP_AFFINITY=verbose,granularity=fine,compact,1,0 -genv KMP_STACKSIZE=512m /home/werner/Public/hpc/vasp/vasp.6.3.0/testsuite/../bin/vasp_gam"

I found that the time performance may be very different for a specific job with different combination of np (i.e., number of processes) and OMP_NUM_THREADS. In my test, I found that the combination of -np 16 and OMP_NUM_THREADS=16 is very time-consuming, and I terminated this testing step before it was over. For a summary of the time benchmarks corresponding to the tests here, see this file and the discussion here and for more detailed information.

So a natural question is: How to find the optimal combination of setting values for number of processes and OMP_NUM_THREADS in a particular computing task? Is there a rule of thumb?

The following is supplementary information as a reply to the comments given by Victor Eijkhout, Homer512 and Jérôme Richard:

  1. See the related info give by inxi:
werner@X10DAi-00:~$ inxi -Cxxx
CPU:       Topology: 2x 22-Core model: Intel Xeon E5-2699 v4 bits: 64 type: MT MCP SMP arch: Broadwell rev: 1 
           L2 cache: 110.0 MiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 387287 
           Speed: 1200 MHz min/max: 1200/3600 MHz Core speeds (MHz): 1: 1200 2: 1202 3: 1202 4: 1202 5: 1200 
           6: 1202 7: 1203 8: 1201 9: 1204 10: 1201 11: 1654 12: 2007 13: 2204 14: 2200 15: 1245 16: 1202 
           17: 1202 18: 1202 19: 1203 20: 1202 21: 1203 22: 1202 23: 1202 24: 1201 25: 1202 26: 1202 27: 1201 
           28: 1202 29: 1202 30: 1202 31: 2066 32: 1202 33: 1202 34: 1202 35: 1203 36: 1202 37: 1202 38: 1202 
           39: 1202 40: 1202 41: 1200 42: 1516 43: 1200 44: 1200 45: 1200 46: 1202 47: 1200 48: 1200 49: 1200 
           50: 1200 51: 1201 52: 1201 53: 1201 54: 1201 55: 1200 56: 1201 57: 1204 58: 1200 59: 1200 60: 1609 
           61: 1871 62: 2200 63: 1251 64: 1201 65: 1201 66: 1201 67: 1200 68: 1203 69: 1200 70: 1201 71: 1201 
           72: 1201 73: 1201 74: 1201 75: 1200 76: 1200 77: 1200 78: 1201 79: 1203 80: 1523 81: 1201 82: 1200 
           83: 1200 84: 1201 85: 1201 86: 1200 87: 1200 88: 1204 
werner@X10DAi-00:~$ inxi -Mxxx
Machine:   Type: Desktop System: Supermicro product: X10DAi v: 123456789 serial: <superuser/root required> 
           Mobo: Supermicro model: X10DAI v: 1.02 serial: <superuser/root required> UEFI: American Megatrends 
           v: 3.2 date: 12/16/2019 
werner@X10DAi-00:~$ inxi -Sxxx
System:    Host: X10DAi-00 Kernel: 5.8.0-43-generic x86_64 bits: 64 compiler: N/A Desktop: GNOME 3.36.9 
           tk: GTK 3.24.20 wm: gnome-shell dm: GDM3 3.36.3 Distro: Ubuntu 20.04.3 LTS (Focal Fossa) 
  1. I retest the test discussed here. See the following for the time baseline and the corresponding combination of options:
nranks=4 nthrds=2
real    0m13.666s
user    1m20.643s
sys 0m4.314s

nranks=8 nthrds=2
real    0m11.908s
user    2m9.973s
sys 0m7.549s

nranks=12 nthrds=2
real    0m11.043s
user    2m55.062s
sys 0m11.161s

nranks=16 nthrds=2
real    0m11.087s
user    3m45.074s
sys 0m15.343s


nranks=4 nthrds=2
real    0m13.511s
user    1m19.949s
sys 0m4.185s

nranks=6 nthrds=4
real    0m13.736s
user    3m38.704s
sys 0m12.471s

nranks=8 nthrds=5
real    0m12.378s
user    5m13.113s
sys 0m18.022s

It seems that the above results are consistent with the comments given by Homer512:

Typical setups to test are one process per core (1-2 threads) or one per LLC with as many threads as appropriate.

Regards, HZ



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source