mpi/pmix: set of optimizations for the _ucx_progress
- Avoid extra ucx_worker_progress call
- Avoid extra initialization in _ucx_progress call
The effect on ping-pong latency test (~500 ns improvement):
size UCXv6 UCXv7
1 19.2 18.8
2 19.2 18.7
4 19.2 18.7
8 19.6 19.2
16 19.9 19.3
32 19.8 19.4
64 20.1 19.5
128 20.5 20.2
256 21.0 20.4
512 21.1 20.6
1024 20.9 20.5
2048 22.1 21.6
4096 25.3 24.6
8192 28.5 27.7
16384 31.2 30.7
32768 37.0 36.1
65536 48.1 47.8
131072 72.6 72.2
262144 2104.7 2229.3
524288 2722.0 2817.8
1048576 3756.2 3693.1
2097152 6206.3 6148.5
4194304 10281.3 10230.7
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
Please register or sign in to comment