Determine if the CCM prologue needs to be rerun during job recovery
As part of the setup activity prior to invoking the CCM prologue on Cray native Slurm systems, the job prolog_running value is incremented and the job_state is OR'd with JOB_CONFIGURING. After the CCM prologue completes, these field changes are removed. That setup activity allows the CCM prologue to complete before the job launch continues. If the slurmctld is shutdown or killed while a CCM prologue is executing, those two job field changes can't be removed since slurmctld is no longer there. Clearing those field values is now handled during job recovery within the select/cray plugin select_p_job_init() procedure. If a job being recovered came from a CCM defined partition and if either of those two field values are still set as above, then the CCM prologue is run again. The CCM prologue handles being called more than once. The above field changes are then removed after this rerun CCM prologue completes. The CCM epilogue is not affected.
Please register or sign in to comment