- 28 Apr, 2011 1 commit
-
-
Danny Auble authored
-
- 27 Apr, 2011 15 commits
-
-
Morris Jette authored
-
Moe Jette authored
-
Morris Jette authored
assorted changes for Cray system use of proctrack/sgi_job
-
Morris Jette authored
This adds a link test for libjob.so * salloc needs direct support (Makefile.am), * however, X_AC_SGI_JOB comes much later in configure.ac. (An alternative would be, since the libjob interface has practically not changed for 2.6 kernels, to integrate its 19 ioctls into slurm.)
-
Morris Jette authored
This adds detection and use of SGI process aggregate job container IDs for salloc interactive sessions. The preferred and documented way to support this on a Cray system is by enabling the provided pam_job.so via /etc/pam.d/common-session. There is a header dependency on job.h. This file depends on the optional cray-libjob-devel package, which installs into /opt/cray/job/<version>. This package is however not always installed or may not be up-to-date. Hence the patch "cheats" by duplicating the known prototype of job_getjid().
-
Morris Jette authored
To work properly, select/cray requires proctrack/sgi_job. In fact, due to the way the container functions are called by the slurmstepd, it will only work properly with this plugin. I have considered alternatives, such as falling back to using the SID to confirm the allocation. But this attempt to support configuration errors creates other problems, such as less stringent error checking.
-
Morris Jette authored
This equips the slurmstepd to use the SGI process aggregate container ID to confirm the ALPS reservation. The way it is coded allows ALPS to return a temporary error in confirming the reservation once, the job will then be requeued. On the other hand, it is mandatory that the Cray job service works correctly, therefore errors are returned if * job container creation fails or * the job is not attached to a container (anticipating later failure in _fork_all_tasks() when slurm_container_add() will fail for the same reason). The patch relies on the internals of the proctrack/sgi_job plugin in order to avoid duplicating code. This dependency is made explicit by a configuration check of a subsequent patch. With these two pieces in place, the frontends are set to DRAINING if a system administrator forgets to enable the /etc/init.d/job service, as shown in the following log entries: slurmd log: [2011-04-26T14:56:04] Launching batch job 134 for UID 21215 [2011-04-26T14:56:04] [134] no PAGG ID: job service disabled on this host? [2011-04-26T14:56:04] [134] could not confirm ALPS resId 253 [2011-04-26T14:56:04] [134] job_manager exiting abnormally, rc = 4014 slurmctld log: [2011-04-26T14:56:03] ALPS RESERVATION #253, JobId 134: BASIL -n 2 -N 1 -d 1 -m 16000 [2011-04-26T14:56:03] sched: Allocate JobId=134 NodeList=nid000[16-17] #CPUs=24 [2011-04-26T14:56:04] error: slurmd error 4014 running JobId=134 on \ front_end=gele2: Slurmd could not set up environment for batch job [2011-04-26T14:56:04] update_front_end: set state of gele2 to DRAINING [2011-04-26T14:56:04] completing job 134
-
Morris Jette authored
This uses the SGI container process aggregate ID to confirm the job reservation. It falls back to using the alloc_sid in case of failure. This fallback should be considered really only as a last resort, since it is known that session IDs are not unique across multiple login nodes and therefore the confirmation of ALPS reservations will fail whenever there is a SID collision (the likelihood increases with system size).
-
Morris Jette authored
-
Morris Jette authored
This extends the Cray-specific select_jobinfo struct with a confirmation cookie field, which is to be used by later patches to store the session SID or PAGG container ID. There is a slight incompatibility with regard to pack/unpack, due to the new confirm_cookie format. Since not many Cray installations exist yet, I would like to suggest to not do an extra bump of the API version.
-
Morris Jette authored
This is just for consistency with other proctrack plugins, which all return 0 to indicate "not found", rather than (uint64_t)-1.
-
Morris Jette authored
The select/cray plugin discovers the topology as part of its initialisation and generates a node ranking. No further topology information is required by the plugin, hence this patch sets the default TopologyPlugin to topology/none.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
- 26 Apr, 2011 19 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
-
Moe Jette authored
-
Moe Jette authored
-
-
Danny Auble authored
-
http://github.com/chaos/slurmDanny Auble authored
-
http://github.com/chaos/slurmDanny Auble authored
-
http://github.com/chaos/slurmDanny Auble authored
-
Morris Jette authored
recent changes for Cray system support
-
Danny Auble authored
-
http://github.com/chaos/slurmDanny Auble authored
-
Moe Jette authored
-
Moe Jette authored
to create job allocation.
-
Moe Jette authored
-
Moe Jette authored
--mem-per-cpu options on a heterogeneous cluster. Patch from Bjorn-Helge Mevik, University of Oslo.
-
Danny Auble authored
-
- 25 Apr, 2011 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
have slurm srun command installed on a cray system instead of the wrapper.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
to function as desired.
-