- 17 Jan, 2013 6 commits
-
-
Morris Jette authored
Conflicts: src/sacctmgr/sacctmgr.c src/sreport/sreport.c
-
David Bigagli authored
-
Morris Jette authored
-
David Bigagli authored
-
Morris Jette authored
From Matthieu Hautreux: However, after discussing the point with onsite Bull support team and looking at the slurmstepd code concerning stdout/err/in redirection we would like to recommend two things for future versions of SLURM : - sutdown(...,SHUT_WR) should be performed when managing the TCP sockets : no shutdown(...,SHUT_WR) is performed on the TCP socket in slurmstepd eio management. Thus, the close() can not reliably inform the other end of the socket that the transmission is done (no TCP_FIN transmitted). As the close is followed by an exit(), the kernel is the only entity that is knowing of the fact that the close may not have been took into account by the other side (wich might be our initial problem) and thus no retry can be performed, letting the server side of the socket (srun) in a position where it can wait for a read until the end of time. - TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP exchanges, thus letting the system potentially deadlocked if a remote host dissapear and the local host is waiting on a read (the write would result in a EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a relatively large timeout value (5 minutes), could enhance the resilience of SLURM for unexpected packet/connection loss without too much implication on the scalability of the solution. The timeout could be configurable in case it is find too aggresive for particular configurations.
-
Morris Jette authored
Added "KeepAliveTime" configuration parameter From Matthieu Hautreux: TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP exchanges, thus letting the system potentially deadlocked if a remote host dissapear and the local host is waiting on a read (the write would result in a EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a relatively large timeout value (5 minutes), could enhance the resilience of SLURM for unexpected packet/connection loss without too much implication on the scalability of the solution. The timeout could be configurable in case it is find too aggresive for particular configurations.
-
- 16 Jan, 2013 18 commits
-
-
Morris Jette authored
Conflicts: src/slurmctld/proc_req.c
-
Morris Jette authored
-
David Bigagli authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Without this change a high priority batch job may not start at submit time. In addtion, a pending job with mutltiple partitions be cancelled when the scheduler runs if any of it's partitions can not be used by the job.
-
David Bigagli authored
-
Morris Jette authored
The original work this was based upon has been replaced with new logic.
-
Morris Jette authored
Without this patch, if the first listed partition lacks nodes with required features the job would be rejected.
-
Morris Jette authored
While this will validate job at submit time, it results in redundant looping when scheduling jobs. Working on alternate patch now.
-
Danny Auble authored
-
Danny Auble authored
submission.
-
Morris Jette authored
-
Morris Jette authored
-
-
Morris Jette authored
-
Morris Jette authored
The gres_plugin_job_test was returning a count of cores available to a job, but the select plugins was treating this as a CPU count. This change converts the core count into a CPU count as needed in the select plugin and changes the comments related to the function gres_plugin_job_test().
-
Danny Auble authored
-
- 15 Jan, 2013 8 commits
-
-
Matthieu Hautreux authored
-
https://github.com/SchedMD/slurmjette authored
-
jette authored
Logic now in priority/multifactor plugin with PriorityFlags=TICKET_BASED.
-
Morris Jette authored
-
Morris Jette authored
Conflicts: src/slurmctld/acct_policy.c
-
Matthieu Hautreux authored
QoS limits enforcement on the controller side is based on a list of used_limits per user. When a user is not yet added to the list, which is common when the controller is restarted and the user has no running jobs, the current logic is to not check some of the "per user limits" and let the submission succeed. However, if one of these limits is a zero-valued limit, the check chould failed as it means that no job should be submitted at all as it would necessarily result in a crossing of the limit. This patch ensures that even when a user is not yet present in the per user used_limits list, the 0-valued limits are correctly treated.
-
David Bigagli authored
Add PriorityFlags value of "TICKET_BASED".
-
Morris Jette authored
-
- 14 Jan, 2013 8 commits
-
-
jette authored
-
Hongjia Cao authored
On job step launch failure, the function "slurm_step_launch_wait_finish()" will be called twice in launch/slurm, which causes srun to be aborted: srun: error: Task launch for 22495.0 failed on node cn6: Job credential expired srun: error: Application launch failed: Job credential expired srun: Job step aborted: Waiting up to 2 seconds for job step to finish. cn5 cn4 cn7 srun: error: Timed out waiting for job step to complete srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: error: Timed out waiting for job step to complete srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed. Aborted (core dumped) The attached patch(version 2.5.1) fixes it. But the message of " Job step aborted: Waiting up to 2 seconds for job step to finish. Timed out waiting for job step to complete " will still be printed twice.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Correction to CPU allocation count logic in for cores without hyperthreading.
-
Hongjia Cao authored
With jobs launched using srun directly which end abnormally, there will be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH SIGNAL 9 ***) from each node. And/or there will be a task-exit-message(srun: error: task[0-1]: Terminated) for each node. For large scale jobs, these messages become tedious and the other error messages will be buried. The attached two patches(for slurm-2.5.1) introduce two environment variables to control the output of such messages: SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will print the step-killed-message; SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task exit messages with the same exit code will be printed only once.
-
Hongjia Cao authored
With jobs launched using srun directly which end abnormally, there will be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH SIGNAL 9 ***) from each node. And/or there will be a task-exit-message(srun: error: task[0-1]: Terminated) for each node. For large scale jobs, these messages become tedious and the other error messages will be buried. The attached two patches(for slurm-2.5.1) introduce two environment variables to control the output of such messages: SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will print the step-killed-message; SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task exit messages with the same exit code will be printed only once.
-
Morris Jette authored
-