- 12 Apr, 2012 1 commit
-
-
Danny Auble authored
-
- 10 Apr, 2012 3 commits
-
-
Danny Auble authored
and time limit where it was previously set by an admin.
-
Danny Auble authored
-
Danny Auble authored
slurmdbd accounting and running large amounts of jobs (>50 sec). Job information could be corrupted before it had a chance to reach the DBD.
-
- 03 Apr, 2012 1 commit
-
-
Morris Jette authored
Add support for new SchedulerParameters of max_depend_depth defining the maximum number of jobs to test for circular dependencies (i.e. job A waits for job B to start and job B waits for job A to start). Default value is 10 jobs.
-
- 30 Mar, 2012 1 commit
-
-
Danny Auble authored
-
- 29 Mar, 2012 1 commit
-
-
Morris Jette authored
The problem was conflicting logic in the select/cons_res plugin. Some of the code was trying to get the job the maximum node count in the range while other logic was trying to minimize spreading out of the job across multiple switches. As you note, this problem only happens when a range of node counts is specified and the select/cons_res plugin and the topology/tree plugin and even then it is not easy to reproduce (you included all of the details below). Quoting Martin.Perry@Bull.com: > Certain combinations of topology configuration and srun -N option produce > spurious job rejection with "Requested node configuration is not > available" with select/cons_res. The following example illustrates the > problem. > > [sulu] (slurm) etc> cat slurm.conf > ... > TopologyPlugin=topology/tree > SelectType=select/cons_res > SelectTypeParameters=CR_Core > ... > > [sulu] (slurm) etc> cat topology.conf > SwitchName=s1 Nodes=xna[13-26] > SwitchName=s2 Nodes=xna[41-45] > SwitchName=s3 Switches=s[1-2] > > [sulu] (slurm) etc> sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > ... > jkob up infinite 4 idle xna[14,19-20,41] > ... > > [sulu] (slurm) etc> srun -N 2-4 -n 4 -p jkob hostname > srun: Force Terminated job 79 > srun: error: Unable to allocate resources: Requested node configuration is > not available > > The problem does not occur with select/linear, or topology/none, or if -N > is omitted, or for certain other values for -N (for example, -N 4-4 and -N > 2-3 work ok). The problem seems to be in function _eval_nodes_topo in > src/plugins/select/cons_res/job_test.c. The srun man page states that when > -N is used, "the job will be allocated as many nodes as possible within > the range specified and without delaying the initiation of the job." > Consistent with this description, the requested number of nodes in the > above example is 4 (req_nodes=4). However, the code that selects the > best-fit topology switches appears to make the selection based on the > minimum required number of nodes (min_nodes=2). It therefore selects > switch s1. s1 has only 3 nodes from partition jkob. Since this is fewer > than req_nodes the job is rejected with the "node configuration" error. > > I'm not sure where the code is going wrong. It could be in the > calculation of the number of needed nodes in function _enough_nodes. Or > it could be in the code that initializes/updates req_nodes or rem_nodes. I > don't feel confident that I understand the logic well enough to propose a > fix without introducing a regression. > > Regards, > Martin
-
- 27 Mar, 2012 2 commits
-
-
Morris Jette authored
When the optional max_time is not specified for --switches=count, the site max (SchedulerParameters=max_switch_wait=seconds) is used for the job. Based on patch from Rod Schultz.
-
Morris Jette authored
Patch by Bill Brophy, Bull.
-
- 26 Mar, 2012 1 commit
-
-
Morris Jette authored
Patch by Don Lipari, LLNL. https://github.com/chaos/slurm/commit/4de11bf0a8cd18207a60e7d3e1fa7a6fde0da431
-
- 21 Mar, 2012 2 commits
-
-
Morris Jette authored
CRAY: Fix support for configuration with SlurmdTimeout=0 (never mark node that is DOWN in ALPS as DOWN in SLURM).
-
Morris Jette authored
-
- 20 Mar, 2012 1 commit
-
-
Morris Jette authored
Improve support for overlapping advanced reservations. Patch from Bill Brophy, Bull.
-
- 16 Mar, 2012 7 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
already pinged it on startup the unresponding flag would be removed from the frontend node.
-
Danny Auble authored
-
Danny Auble authored
mark front end node down.
-
Danny Auble authored
-
- 14 Mar, 2012 1 commit
-
-
Morris Jette authored
Cray - For srun wrapper when creating a job allocation, set the default job name to the executable file's name. Ignore leading directory names in the path.
-
- 13 Mar, 2012 3 commits
-
-
Morris Jette authored
permit the srun and salloc commands to be executed in the background on Cray systems
-
Morris Jette authored
Add new job state reason of "FrontEndDown" which applies only to Cray and IBM BlueGene systems.
-
Danny Auble authored
-
- 12 Mar, 2012 1 commit
-
-
Danny Auble authored
the queue when trying to place a larger than midplane job.
-
- 02 Mar, 2012 1 commit
-
-
Morris Jette authored
In cray/srun wrapper, only include aprun "-q" option when srun "--quiet" option is used.
-
- 29 Feb, 2012 1 commit
-
-
Morris Jette authored
-
- 28 Feb, 2012 1 commit
-
-
Morris Jette authored
-
- 24 Feb, 2012 4 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Danny Auble authored
-
Morris Jette authored
-
- 23 Feb, 2012 1 commit
-
-
Danny Auble authored
-
- 20 Feb, 2012 1 commit
-
-
jette authored
Patch from Aleksej Saushev.
-
- 06 Feb, 2012 1 commit
-
-
Danny Auble authored
is a convenience function in BSD and glibc that internally calls the equivalent of int masterfd = open("/dev/ptmx", flags); grantpt (masterfd); unlockpt (masterfd); int slavefd = open (slave, O_RDRW|O_NOCTTY); (in psuedocode) On Linux, with some combinations of glibc/kernel (in this case glibc-2.14/Linux-3.1), the equivalent of grantpt(3) was failing in slurmstepd with EPERM, because the allocated pty was getting root ownership instead of the user running the slurm job. From the POSIX description of grantpt: "The grantpt() function shall change the mode and ownership of the slave pseudo-terminal device... The user ID of the slave shall be set to the real UID of the calling process..." http://pubs.opengroup.org/onlinepubs/007904875/functions/grantpt.html This means that for POSIX-compliance, the real user id of slurmstepd must be the user executing the SLURM job at the time openpty(3) is called. Unfortunately, the real user id of slurmstepd at this point is still root, and only the effective uid is set to the user. This patch is a work-around that uses the (non-portable) setresuid(2) system call to reset the real and effective uids of the slurmstepd process to the job user, but keep the saved uid of root. Then after the openpty(3) call, the previous credentials are reestablished using the same call.
-
- 03 Feb, 2012 1 commit
-
-
Morris Jette authored
Fix for srun allocating running within existing allocation with --exclude option and --nnodes count small enough to remove more nodes. > salloc -N 8 salloc: Granted job allocation 1000008 > srun -N 2 -n 2 --exclude=tux3 hostname srun: error: Unable to create job step: Requested node configuration is not available Patch from Phil Eckert, LLNL.
-
- 02 Feb, 2012 1 commit
-
-
Morris Jette authored
Fix bug in step task distribution when nodes are not configured in numeric order. Patch from Hongjia Cao, NUDT.
-
- 01 Feb, 2012 2 commits
-
-
Morris Jette authored
Fix bug when requeued batch job is scheduled to run on a different node zero, but attemts job launch on old node zero causing fatal error "Invalid host_index -1 for job #"
-
Morris Jette authored
Avoid slurmctld abort due to bad pointer when setting an advanced reservation MAINT flag if it contains no nodes (only licenses).
-
- 31 Jan, 2012 1 commit
-
-
Danny Auble authored
blocks are in an error state.
-