- 04 Dec, 2012 3 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
jette authored
-
- 03 Dec, 2012 1 commit
-
-
Morris Jette authored
-
- 30 Nov, 2012 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
on them. This should only happen in extreme conditions.
-
Danny Auble authored
-
Danny Auble authored
-
- 29 Nov, 2012 4 commits
-
-
Danny Auble authored
with associations get the deleted associations as well.
-
Danny Auble authored
-
Danny Auble authored
user mark the state canceled instead of completed.
-
Danny Auble authored
so it gets sent again. This isn't a major problem since the start will happen when the job ends, but this does make things cleaner.
-
- 28 Nov, 2012 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
you query against that with -N and -E you will get all jobs during that time instead of only the ones running on -N. Signed-off-by: Danny Auble <da@schedmd.com>
-
- 27 Nov, 2012 6 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
was already in error and isn't deallocating and underlying hardware goes bad one could get overlapping blocks in error making the code assert when a new job request comes in.
-
Morris Jette authored
-
Danny Auble authored
overcommit.
-
- 21 Nov, 2012 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Matthieu Hautreux authored
A dedicated thread (_kill_thr) is launched by slurmstepd at the end of a step in order to destroy the IO thread if it does not manage to correctly terminate by itself after 300 seconds. Two bugs are corrected in this logic by this patch. First, the performed sleep(300) is not protected against interruptions and this delay can be reduced to a few seconds in case of signals received by slurmstepd, thus, reducing the delay and forcing the IO thread to terminate before the expiration of the grace time. The logic is modified to ensure that the delay is respected using a loop around the sleep(). Second, to terminate the IO thread, a SIGKILL is delivered to the IO thread using pthread_kill. However, sending SIGKILL using pthread_kill is a process-wide operation (see man pthread_kill), thus all the slurmstepd threads are killed and slurmstepd is terminated. This logic is modified by using pthread_cancel() instead of pthread_kill() thus letting the pthread_join() of _wait_for_io() having a chance to act as expected. Without this patch, when _kill_thr is interrupted, slurmstepd is terminated, letting the step in a incomplete state, as the node may not have been able to send the REQUEST_STEP_COMPLETE to the controler. Thus, consecutive steps can no longer be executed and stay permanently in the "Job step creation temporarily disabled, retrying" state.
-
Matthieu Hautreux authored
When requesting a particular nodelist for a step, if at least one of the node is still used by a former step (no REQUEST_STEP_COMPLETE received from that node), the current behavior is to return ESLURM_INVALID_TASK_MEMORY and srun aborting with "Memory required by task is not available". This can be reproduced by launching consecutive steps with the -w parameter set to $SLURM_NODELIST and introducing delays in the spank epilog on the execution nodes. The behavior is changed to only defer the execution of the step by returning ESLURM_NODES_BUSY when it is detected that some nodes are blocked because of already used memory.
-
Matthieu Hautreux authored
When using consecutive steps, it appears that in some cases, the time required by the slurmstepd on the execution nodes to inform the controler of the completion of the step is higher than the time required to request the following step. In that scenario, the controler can reject the step by returning the error code ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE even if the step could be executed if all the former steps were correctly finished. This can be reproduced by launching consecutive steps and introducing dalys in the spank epilog on the execution nodes. The behavior is changed to only defer the execution of the step by returning ESLURM_NODES_BUSY when all the available nodes are not idle considering the former steps.
-
- 20 Nov, 2012 2 commits
-
-
Danny Auble authored
slurmctld restart.
-
Morris Jette authored
-
- 19 Nov, 2012 3 commits
-
-
Danny Auble authored
allocation.
-
Morris Jette authored
NOTE: If you were setting the environment variable SLURMSTEPD_OOM_ADJ=-17, it should be set to -1000 for Linux 2.6.36 kernel or later.
-
Danny Auble authored
-
- 09 Nov, 2012 1 commit
-
-
Danny Auble authored
-
- 07 Nov, 2012 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
specifying the number of tasks and not the number of nodes.
-
- 05 Nov, 2012 2 commits
-
-
Morris Jette authored
On job kill requeust, send SIGCONT, SIGTERM, wait KillWait and send SIGKILL. Previously just sent SIGKILL to tasks.
-
Morris Jette authored
-
- 02 Nov, 2012 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-