- 27 Oct, 2014 1 commit
-
-
Morris Jette authored
bug 1207
-
- 24 Oct, 2014 3 commits
-
-
Morris Jette authored
Use default_partition by default for the test
-
Morris Jette authored
Returned test passed, when it should have failed because of bad variable name.
-
David Singleton authored
We've seen slurmctld crashes due to negative job array indices.
-
- 23 Oct, 2014 4 commits
-
-
Morris Jette authored
The previous patch should work in most cases, but this should work more reliably and the comment is more clear bug 1196
-
David Bigagli authored
This reverts commit 7e65f924.
-
David Bigagli authored
-
Morris Jette authored
BGQ: Fix race condition when job fails due to hardware failure and is requeued. Previous code could result in slurmctld abort with NULL pointer. bug 1096
-
- 22 Oct, 2014 1 commit
-
-
Gennaro Oliva authored
-
- 21 Oct, 2014 1 commit
-
-
Morris Jette authored
Fix bug that prevented preservation of a job's GRES bitmap on slurmctld restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a job's gres when requeued" and only applies when GRES mapped to specific files). bug 1192
-
- 20 Oct, 2014 4 commits
-
-
Danny Auble authored
-
David Bigagli authored
permission for the batch step.
-
David Bigagli authored
-
jette authored
Otherwise there will be no log file to write to, resulting in an abort bug 1185
-
- 18 Oct, 2014 1 commit
-
-
Nicolas Joly authored
-
- 17 Oct, 2014 6 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
David Bigagli authored
-
David Bigagli authored
-
Morris Jette authored
Correct tracking of licenses for suspended jobs on slurmctld reconfigure or restart. Previously licenses for suspended jobs were not counted, so the license count could be exceeded with those jobs get resumed.
-
Danny Auble authored
commit a9dc50d4.
-
- 16 Oct, 2014 5 commits
-
-
Brian Christiansen authored
-
Danny Auble authored
to commit a9dc50d4.
-
Morris Jette authored
-
Morris Jette authored
Refine commit 5f89223f based upon feedback from David Gloe: * It's not only MPI jobs, but anything that uses PMI. That includes MPI, shmem, etc, so you may want to reword the error message. * I added the terminated flag because if multiple tasks on a node exit, you would get an error message from each of them. That reduces it to one error message per node. Cray bug 810310 prompted that change. * Since we're now relying on --kill-on-bad-exit, I think we should update the Cray slurm.conf template to default to 1 (set KillOnBadExit=1 in contribs/cray/slurm.conf.template). bug 1171
-
Morris Jette authored
Treat Cray MPI job calling exit() without mpi_fini() as fatal error for that specific task and let srun handle all timeout logic. Previous logic would cancel the entire job step and srun options for wait time and kill on exit were ignored. The new logic provides users with the following type of response: $ srun -n3 -K0 -N3 --wait=60 ./tmp Task:0 Cycle:1 Task:2 Cycle:1 Task:1 Cycle:1 Task:0 Cycle:2 Task:2 Cycle:2 slurmstepd: step 14927.0 task 1 exited without calling mpi_fini() srun: error: tux2: task 1: Killed Task:0 Cycle:3 Task:2 Cycle:3 Task:0 Cycle:4 ... bug 1171
-
- 15 Oct, 2014 9 commits
-
-
Morris Jette authored
This fixes a race condition if the slurmctld needed to power up a node shortly after startup. Previously it would execute the ResumeProgram twice for effected nodes.
-
Morris Jette authored
Without this change, a node in the cloud that failed to power up, would not have its NoResponding flag cleared, which would prevent its later use. The NoResponding flag is now cleared when manuallly when the node is modified to PowerDown.
-
Morris Jette authored
If a batch job launch to the cloud fails, permit an unlimited number of job requeues. Previously the job would abort on the second launch failure.
-
Nicolas Joly authored
This reverts commit 4d03d0b4. Make sure the correct Author is attributed here.
-
Danny Auble authored
This reverts commit 1891936e.
-
Danny Auble authored
This has apparently been broken from the get go. This fixes bug 1172. test21.22 should be updated to test the dump and load of a file that is generated.
-
Danny Auble authored
since this represents the user could be exaggerating their system.
-
Danny Auble authored
-
Danny Auble authored
using --ntasks-per-node. This is related to bug 1145. What was happening is all the cpus were allocated on one socket instead of a cyclic method. While this is allowed it is strange and resulted in this bug. There appears to be a different bug as to why the tasks were laid out in a block fashion in the first place.
-
- 14 Oct, 2014 5 commits
-
-
Danny Auble authored
with no way to get them out. This fixes bug 1134. It is advised the pro/epilog to call xtprocadmin in the script instead of returning a non-zero exit code.
-
Brian Christiansen authored
The job could have been purged from a short MinJobAge and the trigger would then point to an invalid job. Bug #1144
-
Danny Auble authored
-
Morris Jette authored
Note that PlugStackConfig defaults to plugstack.conf in the same directory as slurm.conf. The added logic tests if the file actually exists (using stat) and if not found then do not fork/exec slurmstepd to invoke the spank prolog/epilog. This saves about 14msec on startup and 14msec on shutdown if no spank plugins are configured. It also eliminates some possible failures (e.g. if fork() fails, or the slurmstepd processes can not exec()). This logic also caches the PlugStackConfig value and reads it again on reconfigure, but avoid reading the value for each job. bug 982
-
Morris Jette authored
Add "void" argument to a function and rename a local function to have a prefix of "_"
-