- 21 Nov, 2012 2 commits
-
-
Matthieu Hautreux authored
When requesting a particular nodelist for a step, if at least one of the node is still used by a former step (no REQUEST_STEP_COMPLETE received from that node), the current behavior is to return ESLURM_INVALID_TASK_MEMORY and srun aborting with "Memory required by task is not available". This can be reproduced by launching consecutive steps with the -w parameter set to $SLURM_NODELIST and introducing delays in the spank epilog on the execution nodes. The behavior is changed to only defer the execution of the step by returning ESLURM_NODES_BUSY when it is detected that some nodes are blocked because of already used memory.
-
Matthieu Hautreux authored
When using consecutive steps, it appears that in some cases, the time required by the slurmstepd on the execution nodes to inform the controler of the completion of the step is higher than the time required to request the following step. In that scenario, the controler can reject the step by returning the error code ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE even if the step could be executed if all the former steps were correctly finished. This can be reproduced by launching consecutive steps and introducing dalys in the spank epilog on the execution nodes. The behavior is changed to only defer the execution of the step by returning ESLURM_NODES_BUSY when all the available nodes are not idle considering the former steps.
-
- 20 Nov, 2012 2 commits
-
-
Danny Auble authored
slurmctld restart.
-
Morris Jette authored
-
- 19 Nov, 2012 3 commits
-
-
Danny Auble authored
allocation.
-
Morris Jette authored
NOTE: If you were setting the environment variable SLURMSTEPD_OOM_ADJ=-17, it should be set to -1000 for Linux 2.6.36 kernel or later.
-
Danny Auble authored
-
- 09 Nov, 2012 1 commit
-
-
Danny Auble authored
-
- 07 Nov, 2012 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
specifying the number of tasks and not the number of nodes.
-
- 05 Nov, 2012 2 commits
-
-
Morris Jette authored
On job kill requeust, send SIGCONT, SIGTERM, wait KillWait and send SIGKILL. Previously just sent SIGKILL to tasks.
-
Morris Jette authored
-
- 02 Nov, 2012 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 26 Oct, 2012 1 commit
-
-
Morris Jette authored
-
- 25 Oct, 2012 3 commits
-
-
Morris Jette authored
Incorrect error codes returned in some cases, especially if the slurmdbd is down
-
Morris Jette authored
-
Morris Jette authored
-
- 24 Oct, 2012 1 commit
-
-
Morris Jette authored
Previously for linux systems all information was placed on a single line.
-
- 23 Oct, 2012 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
interface instead of through the normal method.
-
Danny Auble authored
-
Danny Auble authored
-
- 22 Oct, 2012 1 commit
-
-
Danny Auble authored
-
- 19 Oct, 2012 3 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
- 18 Oct, 2012 9 commits
-
-
Danny Auble authored
for passthrough gets removed on a dynamic system.
-
Danny Auble authored
in error for passthrough.
-
Danny Auble authored
-
Danny Auble authored
user's pending allocation was started with srun and then for some reason the slurmctld was brought down and while it was down the srun was removed.
-
Danny Auble authored
previously it overwrote the poll_thread id
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
This is needed for when a free request is added to a block but there are jobs finishing up so we don't start new jobs on the block since they will fail on start.
-
Danny Auble authored
-
- 17 Oct, 2012 1 commit
-
-
Morris Jette authored
Previously the node count would change from c-node count to midplane count (but still be interpreted as a c-node count).
-