- 17 Oct, 2014 16 commits
-
-
Morris Jette authored
Conflicts: META NEWS
-
Morris Jette authored
-
Morris Jette authored
Conflicts: META src/slurmd/slurmd/req.c
-
Morris Jette authored
-
David Bigagli authored
-
David Bigagli authored
-
David Bigagli authored
-
Morris Jette authored
Correct tracking of licenses for suspended jobs on slurmctld reconfigure or restart. Previously licenses for suspended jobs were not counted, so the license count could be exceeded with those jobs get resumed.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
commit a9dc50d4.
-
Morris Jette authored
Add sbcast support for file transfer to resources allocated to a job step rather than a job allocation.
-
Nicolas Joly authored
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
-
- 16 Oct, 2014 16 commits
-
-
Danny Auble authored
be the SLURM_*_PROTOCOL_VERSION.
-
Morris Jette authored
bug 1178
-
Morris Jette authored
No change in functionality, just eliminated a call to time().
-
Danny Auble authored
-
Morris Jette authored
Insure that a file needed for the test exists. This problem happens on emulated cray systems.
-
Morris Jette authored
-
Danny Auble authored
Conflicts: src/plugins/jobcomp/mysql/jobcomp_mysql.c
-
Brian Christiansen authored
-
Danny Auble authored
-
Danny Auble authored
to commit a9dc50d4.
-
Danny Auble authored
Conflicts: src/plugins/task/cray/task_cray.c
-
Morris Jette authored
-
Morris Jette authored
Conflicts: contribs/cray/slurm.conf.template src/plugins/task/cray/task_cray.c src/slurmctld/job_mgr.c
-
Morris Jette authored
Refine commit 5f89223f based upon feedback from David Gloe: * It's not only MPI jobs, but anything that uses PMI. That includes MPI, shmem, etc, so you may want to reword the error message. * I added the terminated flag because if multiple tasks on a node exit, you would get an error message from each of them. That reduces it to one error message per node. Cray bug 810310 prompted that change. * Since we're now relying on --kill-on-bad-exit, I think we should update the Cray slurm.conf template to default to 1 (set KillOnBadExit=1 in contribs/cray/slurm.conf.template). bug 1171
-
Morris Jette authored
Treat Cray MPI job calling exit() without mpi_fini() as fatal error for that specific task and let srun handle all timeout logic. Previous logic would cancel the entire job step and srun options for wait time and kill on exit were ignored. The new logic provides users with the following type of response: $ srun -n3 -K0 -N3 --wait=60 ./tmp Task:0 Cycle:1 Task:2 Cycle:1 Task:1 Cycle:1 Task:0 Cycle:2 Task:2 Cycle:2 slurmstepd: step 14927.0 task 1 exited without calling mpi_fini() srun: error: tux2: task 1: Killed Task:0 Cycle:3 Task:2 Cycle:3 Task:0 Cycle:4 ... bug 1171
-
jette authored
Specifically if job terminated from suspend state, end_time will be the time that the job suspend began
-
- 15 Oct, 2014 8 commits
-
-
Morris Jette authored
If set, powered down nodes in the cloud will be visible.
-
Morris Jette authored
This fixes a race condition if the slurmctld needed to power up a node shortly after startup. Previously it would execute the ResumeProgram twice for effected nodes.
-
Morris Jette authored
Without this change, a node in the cloud that failed to power up, would not have its NoResponding flag cleared, which would prevent its later use. The NoResponding flag is now cleared when manuallly when the node is modified to PowerDown.
-
Morris Jette authored
If a batch job launch to the cloud fails, permit an unlimited number of job requeues. Previously the job would abort on the second launch failure.
-
Morris Jette authored
This fixes a race condition if the slurmctld needed to power up a node shortly after startup. Previously it would execute the ResumeProgram twice for effected nodes.
-
Morris Jette authored
Without this change, a node in the cloud that failed to power up, would not have its NoResponding flag cleared, which would prevent its later use. The NoResponding flag is now cleared when manuallly when the node is modified to PowerDown.
-
Morris Jette authored
If a batch job launch to the cloud fails, permit an unlimited number of job requeues. Previously the job would abort on the second launch failure.
-
Nicolas Joly authored
to commit fa73701e.
-