- 14 Apr, 2017 8 commits
-
-
Dong Ahn authored
As specified in MPIR debug interface (https://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf), the presence of the MPIR_partial_attach_ok symbol should inform the debugger that the initial startup synchronization is implemented in such a way that the tool need not attach nor continue MPI processes that the user is not interested in controlling. To implement this, SLURM chose to send SIGCONT to those processes that are not attached by the debugger. However, the old code does not reliably detect the condition in which a process is traced by the debugger, and this has lead to various side effects. On some systems (e.g., TOSS2), the old code sends SIGCONT to all of the target processes including those attached by the debugger. On newer systems (e.g., TOSS3), it does not send SIGCONT to the target processes at all. It seems that one of the reasons for such undefined behavior is the use of CLONE_PTRACE. @grondo found no documentation that indicates CLONE_PTRACE is for the case where the process is being attached by a debugger. More importantly, this code is matching clone(2) flags to proc(5) process flags, which are not the same, as task->flags defined as PF_* flags from kernel source include/linux/sched.h. This patch fixes these problems by replacing the old detection logic with ones based on the TracerPid field in /proc/<pid>/status. From proc(5), TracerPid: PID of process tracing this process (0 if not being traced).
-
Thomas Opfer authored
Improve job scheduling sort after sorting by priority we now sort by submit time and then by job id. We used to not consider submit time. This handles the case where the job_ids have rolled or we are doing federation scheduling. Bug 3524
-
Morris Jette authored
All problems introduced in the course of changing un/pack logic required for removing pack jobs logic
-
Morris Jette authored
-
Morris Jette authored
bug 926
-
Morris Jette authored
bug 926
-
Morris Jette authored
bug 926
-
Brian Christiansen authored
Display with -Ocluster Sort with -S[+|-]cluster
-
- 13 Apr, 2017 32 commits
-
-
Morris Jette authored
bug 926
-
Danny Auble authored
-
Danny Auble authored
least an admin for clearer code.
-
Danny Auble authored
We can used the authorized key else.
-
Danny Auble authored
-
Danny Auble authored
but don't give them admin privileges.
-
Tim Wickberg authored
-
Brian Christiansen authored
Add missing --local option Add note about -M implying --local.
-
Brian Christiansen authored
-
Brian Christiansen authored
environment variable and indentation.
-
Brian Christiansen authored
They could either be running on another cluster or not active on the cluster.
-
Brian Christiansen authored
-
Brian Christiansen authored
Continuation of 3a9970c0
-
Morris Jette authored
-
Morris Jette authored
If a task in a parallel job fails and it was launched with the --kill-on-bad-exit option then terminated the remaining tasks using the SIGCONT, SIGTERM and SIGKILL signals rather than just sending SIGKILL.
-
Morris Jette authored
Sinfo command with support for federated clusters
-
Brian Christiansen authored
-
Brian Christiansen authored
The job was completing before the test could find that the job was running. The script was running for 10 seconds and wait_for_fed_job can back off and check only every 10 seconds.
-
Brian Christiansen authored
Can't use the revoked state to determine if the job is pending or not since an origin job could be revoked if it doesn't have an active job on itself.
-
Brian Christiansen authored
-
Morris Jette authored
Coverity CID 45347
-
Morris Jette authored
Coverity CID 45345 and 45346
-
Morris Jette authored
Coverity CID 45342
-
Morris Jette authored
Coverity CID 45340
-
Morris Jette authored
Coverity CID 45341
-
Morris Jette authored
Avoid comparision between double and int Coverity CID 45336
-
Morris Jette authored
Coverity CID 44687
-
Morris Jette authored
This eliminates comparing a double with an integer Coverity CID 45337
-
Morris Jette authored
Coverity CID 44872
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Add slurm_load_partitions2() function to route RPC to specific cluster Add slurm_load_node2() and slurm_load_node_single2() functions to route requests to specific cluster in a federation Modify srun to get node/partition information for each cluster in a federation at the same time using separate pthreads Add sinfo sort by cluster name (--sort V)
-