Commits · 18e3d6fbc85604253d3a1c39bceb877ba90368dd · Manuel G. Marciani / ces_slurm_simulator

14 Apr, 2017 8 commits

Fix MPIR_partial_attach_ok issues for parallel debuggers. · 18e3d6fb

Dong Ahn authored Apr 14, 2017

As specified in MPIR debug interface
(https://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf),
the presence of the MPIR_partial_attach_ok symbol
should inform the debugger that the initial startup synchronization
is implemented in such a way that the tool need not attach
nor continue MPI processes that the user is not interested in controlling.

To implement this, SLURM chose to send SIGCONT to those processes that are
not attached by the debugger.

However, the old code does not reliably detect the condition
in which a process is traced by the debugger, and this
has lead to various side effects.

On some systems (e.g., TOSS2), the old code sends SIGCONT to
all of the target processes including those attached by the debugger.
On newer systems (e.g., TOSS3), it does not send SIGCONT
to the target processes at all.

It seems that one of the reasons for such undefined behavior
is the use of CLONE_PTRACE.
@grondo found no documentation that indicates
CLONE_PTRACE is for the case where the process is being attached
by a debugger.
More importantly, this code is matching clone(2) flags
to proc(5) process flags, which are not the same, as task->flags
defined as PF_* flags from kernel source include/linux/sched.h.

This patch fixes these problems by replacing
the old detection logic with ones based on the TracerPid field
in /proc/<pid>/status.

From proc(5), TracerPid: PID of process tracing this process (0 if not
being traced).

18e3d6fb

Include submit_time when doing the sort for job scheduling. · 030d9d4b

Thomas Opfer authored Apr 14, 2017

Improve job scheduling sort after sorting by priority we now sort by
submit time and then by job id.  We used to not consider submit time.  This
handles the case where the job_ids have rolled or we are doing federation
scheduling.

Bug 3524

030d9d4b

Fix problems reported in latest coverity report · a93b6a07

Morris Jette authored Apr 14, 2017

All problems introduced in the course of changing un/pack logic
  required for removing pack jobs logic

a93b6a07

Merge branch 'unpack' · 0cba10d4
Morris Jette authored Apr 14, 2017

0cba10d4
Revert commit 133a4249 · 1fc38b96
Morris Jette authored Apr 14, 2017
```
bug 926
```
1fc38b96
Revert commit 1b010388 · 41749bf9
Morris Jette authored Apr 13, 2017
```
bug 926
```
41749bf9
Revert commits c6e3bc97 and ece58780 · ae3b2e78
Morris Jette authored Apr 13, 2017
```
bug 926
```
ae3b2e78
Display job's cluster name in squeue · 53e03c8a
Brian Christiansen authored Apr 13, 2017
```
Display with -Ocluster
Sort with -S[+|-]cluster
```
53e03c8a

13 Apr, 2017 32 commits
- Revert commit 8f4604bb · 91bab0fb
  Morris Jette authored Apr 13, 2017
```
bug 926
```
  91bab0fb
- Minor formatting fix. · 3ac2a65d
  Danny Auble authored Apr 13, 2017
  
  3ac2a65d
- Change variables to be operator when we are using it to distinguish we are at · fe244ba0
  Danny Auble authored Apr 13, 2017
```
least an admin for clearer code.
```
  fe244ba0
- There is no reason to check the user id after we have already checked it. · e28f4300
  Danny Auble authored Apr 13, 2017
```
We can used the authorized key else.
```
  e28f4300
- Get rid of extra "admin" variable. · 28ce131b
  Danny Auble authored Apr 13, 2017
  
  28ce131b
- Make it so acct coord can change things on a job they are a coordinator over, · 7f3b7f1b
  Danny Auble authored Apr 13, 2017
```
but don't give them admin privileges.
```
  7f3b7f1b
- Remove empty source files, and detach from build. · f004de45
  Tim Wickberg authored Apr 13, 2017
  
  f004de45
- Update sinfo man page · 2acc666f
  Brian Christiansen authored Apr 13, 2017
```
Add missing --local option
Add note about -M implying --local.
```
  2acc666f
- Add SINFO_LOCAL env variable. · effa96ce
  Brian Christiansen authored Apr 13, 2017
  
  effa96ce
- Fix sprio man page · 4f674d59
  Brian Christiansen authored Apr 13, 2017
```
environment variable and indentation.
```
  4f674d59
- Don't show revoked jobs in sprio · 18d36575
  Brian Christiansen authored Apr 13, 2017
```
They could either be running on another cluster or not active on the
cluster.
```
  18d36575
- Handle federated sprio when no jobs are returned · 9c202931
  Brian Christiansen authored Apr 13, 2017
  
  9c202931
- scontrol -M<clsuter> show jobs implies --local · b8e421b0
  Brian Christiansen authored Apr 13, 2017
```
Continuation of 3a9970c0
```
  b8e421b0
- Correc typo in NEWS · 9acb6350
  Morris Jette authored Apr 13, 2017
  
  9acb6350
- Modify --kill-on-bad-exit clean up logic · 8ed8098b
  Morris Jette authored Apr 13, 2017
```
If a task in a parallel job fails and it was launched with the
    --kill-on-bad-exit option then terminated the remaining tasks using the
    SIGCONT, SIGTERM and SIGKILL signals rather than just sending SIGKILL.
```
  8ed8098b
- Update NEWS for commit 437ee1f4 · c9c6f2be
  Morris Jette authored Apr 13, 2017
```
Sinfo command with support for federated clusters
```
  c9c6f2be
- Merge branch 'federation' · d336c234
  Brian Christiansen authored Apr 13, 2017
  
  d336c234
- Increase test37.5 script runtime · 1fb55a2c
  Brian Christiansen authored Apr 12, 2017
```
The job was completing before the test could find that the job was
running. The script was running for 10 seconds and wait_for_fed_job can
back off and check only every 10 seconds.
```
  1fb55a2c
- Handle requeue of pending fed job. · 5678a748
  Brian Christiansen authored Apr 12, 2017
```
Can't use the revoked state to determine if the job is pending or not
since an origin job could be revoked if it doesn't have an active job on
itself.
```
  5678a748
- Preserve cluster rpc queue on fed update · 42322197
  Brian Christiansen authored Apr 11, 2017
  
  42322197
- Remove dead code · ee70be04
  Morris Jette authored Apr 11, 2017
```
Coverity CID 45347
```
  ee70be04
- restructure to eliminate dead code · 4ed8b01e
  Morris Jette authored Apr 11, 2017
```
Coverity CID 45345 and 45346
```
  4ed8b01e
- Remove dead code · 143adc7f
  Morris Jette authored Apr 11, 2017
```
Coverity CID 45342
```
  143adc7f
- remove dead code · 5201fd41
  Morris Jette authored Apr 11, 2017
```
Coverity CID 45340
```
  5201fd41
- Remove dead code · 384ef1b4
  Morris Jette authored Apr 11, 2017
```
Coverity CID 45341
```
  384ef1b4
- Replace difftime() with subtraction · 78ec48ec
  Morris Jette authored Apr 11, 2017
```
Avoid comparision between double and int
Coverity CID 45336
```
  78ec48ec
- Add error check for getsockname() · d00bd467
  Morris Jette authored Apr 11, 2017
```
Coverity CID 44687
```
  d00bd467
- Replace difftime() with subtraction · 39ff6710
  Morris Jette authored Apr 11, 2017
```
This eliminates comparing a double with an integer
Coverity CID 45337
```
  39ff6710
- Add check for NULL pointer · 9b4ced81
  Morris Jette authored Apr 11, 2017
```
Coverity CID 44872
```
  9b4ced81
- Plug a couple of memory leaks · 6eec8022
  Morris Jette authored Apr 12, 2017
  
  6eec8022
- Merge branch 'sinfo_fed' · 947ca747
  Morris Jette authored Apr 12, 2017
  
  947ca747
- Modify sinfo to operate in a federation · 437ee1f4
  Morris Jette authored Apr 12, 2017
```
Add slurm_load_partitions2() function to route RPC to specific cluster
Add slurm_load_node2() and slurm_load_node_single2() functions to
  route requests to specific cluster in a federation
Modify srun to get node/partition information for each cluster in a
  federation at the same time using separate pthreads
Add sinfo sort by cluster name (--sort V)
```
  437ee1f4