Commit d64a5f67 authored by Morris Jette's avatar Morris Jette
Browse files

Retry MPI reserved port logic only for non-pack job steps

Ancient versions of OpenMPI and their derivatives (i.e. Cray MPI) are
dependent upon communication ports being assigned to them by Slurm. Such MPI
jobs will experience step launch failure if any component of a
heterogeneous job step is unable to acquire the allocated ports.
Non-heterogeneous job steps will retry step launch using a new set of
communication ports (no change in Slurm behavior).

NOTE: Correcting this would necessitate assigning the same set of ports
to all components of the heterogeneous job (not possible today) plus changes to
srun in order to better synchronize the step startup and error handling.
parent f4bf82c3
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment