slurmstepd: use (SGI) container ID to confirm ALPS reservation
This equips the slurmstepd to use the SGI process aggregate container ID to confirm the ALPS reservation. The way it is coded allows ALPS to return a temporary error in confirming the reservation once, the job will then be requeued. On the other hand, it is mandatory that the Cray job service works correctly, therefore errors are returned if * job container creation fails or * the job is not attached to a container (anticipating later failure in _fork_all_tasks() when slurm_container_add() will fail for the same reason). The patch relies on the internals of the proctrack/sgi_job plugin in order to avoid duplicating code. This dependency is made explicit by a configuration check of a subsequent patch. With these two pieces in place, the frontends are set to DRAINING if a system administrator forgets to enable the /etc/init.d/job service, as shown in the following log entries: slurmd log: [2011-04-26T14:56:04] Launching batch job 134 for UID 21215 [2011-04-26T14:56:04] [134] no PAGG ID: job service disabled on this host? [2011-04-26T14:56:04] [134] could not confirm ALPS resId 253 [2011-04-26T14:56:04] [134] job_manager exiting abnormally, rc = 4014 slurmctld log: [2011-04-26T14:56:03] ALPS RESERVATION #253, JobId 134: BASIL -n 2 -N 1 -d 1 -m 16000 [2011-04-26T14:56:03] sched: Allocate JobId=134 NodeList=nid000[16-17] #CPUs=24 [2011-04-26T14:56:04] error: slurmd error 4014 running JobId=134 on \ front_end=gele2: Slurmd could not set up environment for batch job [2011-04-26T14:56:04] update_front_end: set state of gele2 to DRAINING [2011-04-26T14:56:04] completing job 134
Please register or sign in to comment