- 15 Jan, 2013 2 commits
-
-
David Bigagli authored
Add PriorityFlags value of "TICKET_BASED".
-
Morris Jette authored
-
- 14 Jan, 2013 12 commits
-
-
jette authored
-
Hongjia Cao authored
On job step launch failure, the function "slurm_step_launch_wait_finish()" will be called twice in launch/slurm, which causes srun to be aborted: srun: error: Task launch for 22495.0 failed on node cn6: Job credential expired srun: error: Application launch failed: Job credential expired srun: Job step aborted: Waiting up to 2 seconds for job step to finish. cn5 cn4 cn7 srun: error: Timed out waiting for job step to complete srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: error: Timed out waiting for job step to complete srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed. Aborted (core dumped) The attached patch(version 2.5.1) fixes it. But the message of " Job step aborted: Waiting up to 2 seconds for job step to finish. Timed out waiting for job step to complete " will still be printed twice.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Correction to CPU allocation count logic in for cores without hyperthreading.
-
Hongjia Cao authored
With jobs launched using srun directly which end abnormally, there will be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH SIGNAL 9 ***) from each node. And/or there will be a task-exit-message(srun: error: task[0-1]: Terminated) for each node. For large scale jobs, these messages become tedious and the other error messages will be buried. The attached two patches(for slurm-2.5.1) introduce two environment variables to control the output of such messages: SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will print the step-killed-message; SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task exit messages with the same exit code will be printed only once.
-
Hongjia Cao authored
With jobs launched using srun directly which end abnormally, there will be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH SIGNAL 9 ***) from each node. And/or there will be a task-exit-message(srun: error: task[0-1]: Terminated) for each node. For large scale jobs, these messages become tedious and the other error messages will be buried. The attached two patches(for slurm-2.5.1) introduce two environment variables to control the output of such messages: SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will print the step-killed-message; SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task exit messages with the same exit code will be printed only once.
-
Morris Jette authored
-
Morris Jette authored
-
Yair Yarom authored
-
Morris Jette authored
-
Morris Jette authored
-
- 11 Jan, 2013 10 commits
-
-
https://github.com/SchedMD/slurmjette authored
-
jette authored
User root or SlurmUser don't need valid sbcast credential
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
This can be useful for testing purposes
-
Morris Jette authored
-
jette authored
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 10 Jan, 2013 15 commits
-
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Used to specify the communication protocol to be used for ALPS/BASIL.
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
-
Danny Auble authored
-
jette authored
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
- 09 Jan, 2013 1 commit
-
-
Danny Auble authored
-