Decrease the maximum sleep time between srun job step creation retry
attempts from 60 seconds to 29 seconds. This should eliminate a possible synchronization problem with gang scheduling that could result in job step creation requests only occuring when a job is suspended.
Please register or sign in to comment