slurmstepd: use (SGI) container ID to confirm ALPS reservation (35f5133e) · Commits · Manuel G. Marciani / ces_slurm_simulator

Commit 35f5133e authored Apr 26, 2011 by

Morris Jette

slurmstepd: use (SGI) container ID to confirm ALPS reservation

This equips the slurmstepd to use the SGI process aggregate container
ID to confirm the ALPS reservation.

The way it is coded allows ALPS to return a temporary error in confirming the
reservation once, the job will then be requeued.

On the other hand, it is mandatory that the Cray job service works correctly,
therefore errors are returned if
 * job container creation fails or
 * the job is not attached to a container (anticipating later failure in
   _fork_all_tasks() when slurm_container_add() will fail for the same reason).

The patch relies on the internals of the proctrack/sgi_job plugin in order to
avoid duplicating code. This dependency is made explicit by a configuration
check of a subsequent patch.

With these two pieces in place, the frontends are set to DRAINING if a system
administrator forgets to enable the /etc/init.d/job service, as shown in the
following log entries:

slurmd log:
[2011-04-26T14:56:04] Launching batch job 134 for UID 21215
[2011-04-26T14:56:04] [134] no PAGG ID: job service disabled on this host?
[2011-04-26T14:56:04] [134] could not confirm ALPS resId 253
[2011-04-26T14:56:04] [134] job_manager exiting abnormally, rc = 4014

slurmctld log:
[2011-04-26T14:56:03] ALPS RESERVATION #253, JobId 134: BASIL -n 2 -N 1 -d 1 -m 16000
[2011-04-26T14:56:03] sched: Allocate JobId=134 NodeList=nid000[16-17] #CPUs=24
[2011-04-26T14:56:04] error: slurmd error 4014 running JobId=134 on \
			front_end=gele2: Slurmd could not set up environment for batch job
[2011-04-26T14:56:04] update_front_end: set state of gele2 to DRAINING
[2011-04-26T14:56:04] completing job 134

parent ac21f730

Hide whitespace changes

Inline Side-by-side

Please register or to comment