- 16 Jan, 2013 5 commits
-
-
Morris Jette authored
-
-
Morris Jette authored
-
Morris Jette authored
The gres_plugin_job_test was returning a count of cores available to a job, but the select plugins was treating this as a CPU count. This change converts the core count into a CPU count as needed in the select plugin and changes the comments related to the function gres_plugin_job_test().
-
Danny Auble authored
-
- 15 Jan, 2013 1 commit
-
-
Matthieu Hautreux authored
QoS limits enforcement on the controller side is based on a list of used_limits per user. When a user is not yet added to the list, which is common when the controller is restarted and the user has no running jobs, the current logic is to not check some of the "per user limits" and let the submission succeed. However, if one of these limits is a zero-valued limit, the check chould failed as it means that no job should be submitted at all as it would necessarily result in a crossing of the limit. This patch ensures that even when a user is not yet present in the per user used_limits list, the 0-valued limits are correctly treated.
-
- 14 Jan, 2013 6 commits
-
-
jette authored
-
Hongjia Cao authored
On job step launch failure, the function "slurm_step_launch_wait_finish()" will be called twice in launch/slurm, which causes srun to be aborted: srun: error: Task launch for 22495.0 failed on node cn6: Job credential expired srun: error: Application launch failed: Job credential expired srun: Job step aborted: Waiting up to 2 seconds for job step to finish. cn5 cn4 cn7 srun: error: Timed out waiting for job step to complete srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: error: Timed out waiting for job step to complete srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed. Aborted (core dumped) The attached patch(version 2.5.1) fixes it. But the message of " Job step aborted: Waiting up to 2 seconds for job step to finish. Timed out waiting for job step to complete " will still be printed twice.
-
Morris Jette authored
-
Yair Yarom authored
-
Morris Jette authored
-
Morris Jette authored
-
- 11 Jan, 2013 6 commits
-
-
https://github.com/SchedMD/slurmjette authored
-
jette authored
User root or SlurmUser don't need valid sbcast credential
-
Morris Jette authored
-
jette authored
-
jette authored
-
Morris Jette authored
-
- 10 Jan, 2013 7 commits
-
-
jette authored
-
Morris Jette authored
-
jette authored
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Danny Auble authored
-
- 09 Jan, 2013 6 commits
-
-
Danny Auble authored
-
Nathan Yee authored
-
Morris Jette authored
-
Morris Jette authored
-
David Bigagli authored
-
Danny Auble authored
-
- 08 Jan, 2013 6 commits
-
-
Danny Auble authored
-
jette authored
-
jette authored
-
Morris Jette authored
-
Rod Schultz authored
One of our testers has observed that when a long running job continues to run after a maintenance reservation comes into effect sinfo reports the node as being in the allocated state while scontrol shows it to be in the maintenance state. This can happen when a node is not completely allocated. (select cons_res, a partition which is not Shared=EXCLUSIVE, jobs allocated without –exclusive, or jobs that are allocated only some of the cpus on a node.) Execution paths leading up to calls to node_state_string (slurm_protocol_defs.c) or node_state_string_compact, in scontrol, test for allocated_cpus less that total_cpus on the node and set the node state to MIXED rather than ALLOCATED, while similar paths in sinfo do not. I think this is probably a bug, since the mixed state is defined and think it is desirable that both command return the same result. The problem can be fixed with two logic changes (in multiple places) 1) node_state_string and node_state_string_compact have to check for mixed as well as allocated before returning the MAINT state. This means that the reported state for the node with the allocated job will be MIXED. 2) Sinfo must also check allocated_cpus less than total_cpus and set the state to MIXED before calling either node_state_string or node_state_string_compact. The attached patch (against 2.5.1) makes these changes. The attached script is a test case.
-
Morris Jette authored
-
- 03 Jan, 2013 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-