Commits · d41f1a89f8ce731a4405906105a890f6d28be34b · Manuel G. Marciani / ces_slurm_simulator

15 Apr, 2017 7 commits
- When running the "scontrol top" command, make sure that all of the user's · d41f1a89
  Morris Jette authored Apr 13, 2017
```
jobs have a priority that is lower than the selected job. Previous logic
would permit other jobs with equal priority (no jobs with higher priority).

Bug 3650
```
  d41f1a89
- docs - fix spelling of 'display' · 2edc2e7a
  Tim Wickberg authored Apr 13, 2017
  
  2edc2e7a
- Reset backfill timers correctly without skipping over them in certain · 0241c36a
  Dominik Bartkiewicz authored Apr 13, 2017
```
circumstances.
```
  0241c36a
- Add --array-unique to squeue which will display one unique pending job · 3193033c
  Tim Shaw authored Apr 12, 2017
```
array element per line.

Bug 3573
```
  3193033c
- Always free the old dependency string when asked to rebuild. · 96a8867b
  Bill Brophy authored Apr 12, 2017
```
If the depend_list is NULL or has zero elements, the string should
be cleared as well.

Bug 3651.
```
  96a8867b
- Fix segfault when using AdminComment field with job arrays. · 96899988
  Thomas Opfer authored Apr 07, 2017
```
The field needs to have its own copy, otherwise the pointer will
become invalid when xfree()'d by a separate array task.

Bug 3665.
```
  96899988
- Increase --cpu_bind and --mem_bind length limit to 1024 * 128 bytes. · c8e0d472
  Alejandro Sanchez authored Apr 07, 2017
```
So that it is the same max length as in src/common/env.c.

Used for explicitly laying out tasks on large CPU count nodes (e.g., KNL).

Bug 3675.
```
  c8e0d472
14 Apr, 2017 12 commits

Dispaly and sort job step's cluster name · 1190117e
Brian Christiansen authored Apr 14, 2017
```
For use in federated environments.
```
1190117e
Extract cluster_in_federation test in common place · 61c0a84e
Brian Christiansen authored Apr 14, 2017

61c0a84e
squeue: report fed job steps if in federation · 5b6e11b8
Brian Christiansen authored Apr 14, 2017

5b6e11b8
Ensure NULL is returned · 7bc058dc
Brian Christiansen authored Apr 14, 2017
```
in any case.
```
7bc058dc

Fix MPIR_partial_attach_ok issues for parallel debuggers. · 18e3d6fb

Dong Ahn authored Apr 14, 2017

As specified in MPIR debug interface
(https://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf),
the presence of the MPIR_partial_attach_ok symbol
should inform the debugger that the initial startup synchronization
is implemented in such a way that the tool need not attach
nor continue MPI processes that the user is not interested in controlling.

To implement this, SLURM chose to send SIGCONT to those processes that are
not attached by the debugger.

However, the old code does not reliably detect the condition
in which a process is traced by the debugger, and this
has lead to various side effects.

On some systems (e.g., TOSS2), the old code sends SIGCONT to
all of the target processes including those attached by the debugger.
On newer systems (e.g., TOSS3), it does not send SIGCONT
to the target processes at all.

It seems that one of the reasons for such undefined behavior
is the use of CLONE_PTRACE.
@grondo found no documentation that indicates
CLONE_PTRACE is for the case where the process is being attached
by a debugger.
More importantly, this code is matching clone(2) flags
to proc(5) process flags, which are not the same, as task->flags
defined as PF_* flags from kernel source include/linux/sched.h.

This patch fixes these problems by replacing
the old detection logic with ones based on the TracerPid field
in /proc/<pid>/status.

From proc(5), TracerPid: PID of process tracing this process (0 if not
being traced).

18e3d6fb

Include submit_time when doing the sort for job scheduling. · 030d9d4b

Thomas Opfer authored Apr 14, 2017

Improve job scheduling sort after sorting by priority we now sort by
submit time and then by job id.  We used to not consider submit time.  This
handles the case where the job_ids have rolled or we are doing federation
scheduling.

Bug 3524

030d9d4b

Fix problems reported in latest coverity report · a93b6a07

Morris Jette authored Apr 14, 2017

All problems introduced in the course of changing un/pack logic
  required for removing pack jobs logic

a93b6a07

Merge branch 'unpack' · 0cba10d4
Morris Jette authored Apr 14, 2017

0cba10d4
Revert commit 133a4249 · 1fc38b96
Morris Jette authored Apr 14, 2017
```
bug 926
```
1fc38b96
Revert commit 1b010388 · 41749bf9
Morris Jette authored Apr 13, 2017
```
bug 926
```
41749bf9
Revert commits c6e3bc97 and ece58780 · ae3b2e78
Morris Jette authored Apr 13, 2017
```
bug 926
```
ae3b2e78
Display job's cluster name in squeue · 53e03c8a
Brian Christiansen authored Apr 13, 2017
```
Display with -Ocluster
Sort with -S[+|-]cluster
```
53e03c8a

13 Apr, 2017 21 commits
- Revert commit 8f4604bb · 91bab0fb
  Morris Jette authored Apr 13, 2017
```
bug 926
```
  91bab0fb
- Minor formatting fix. · 3ac2a65d
  Danny Auble authored Apr 13, 2017
  
  3ac2a65d
- Change variables to be operator when we are using it to distinguish we are at · fe244ba0
  Danny Auble authored Apr 13, 2017
```
least an admin for clearer code.
```
  fe244ba0
- There is no reason to check the user id after we have already checked it. · e28f4300
  Danny Auble authored Apr 13, 2017
```
We can used the authorized key else.
```
  e28f4300
- Get rid of extra "admin" variable. · 28ce131b
  Danny Auble authored Apr 13, 2017
  
  28ce131b
- Make it so acct coord can change things on a job they are a coordinator over, · 7f3b7f1b
  Danny Auble authored Apr 13, 2017
```
but don't give them admin privileges.
```
  7f3b7f1b
- Remove empty source files, and detach from build. · f004de45
  Tim Wickberg authored Apr 13, 2017
  
  f004de45
- Update sinfo man page · 2acc666f
  Brian Christiansen authored Apr 13, 2017
```
Add missing --local option
Add note about -M implying --local.
```
  2acc666f
- Add SINFO_LOCAL env variable. · effa96ce
  Brian Christiansen authored Apr 13, 2017
  
  effa96ce
- Fix sprio man page · 4f674d59
  Brian Christiansen authored Apr 13, 2017
```
environment variable and indentation.
```
  4f674d59
- Don't show revoked jobs in sprio · 18d36575
  Brian Christiansen authored Apr 13, 2017
```
They could either be running on another cluster or not active on the
cluster.
```
  18d36575
- Handle federated sprio when no jobs are returned · 9c202931
  Brian Christiansen authored Apr 13, 2017
  
  9c202931
- scontrol -M<clsuter> show jobs implies --local · b8e421b0
  Brian Christiansen authored Apr 13, 2017
```
Continuation of 3a9970c0
```
  b8e421b0
- Correc typo in NEWS · 9acb6350
  Morris Jette authored Apr 13, 2017
  
  9acb6350
- Modify --kill-on-bad-exit clean up logic · 8ed8098b
  Morris Jette authored Apr 13, 2017
```
If a task in a parallel job fails and it was launched with the
    --kill-on-bad-exit option then terminated the remaining tasks using the
    SIGCONT, SIGTERM and SIGKILL signals rather than just sending SIGKILL.
```
  8ed8098b
- Update NEWS for commit 437ee1f4 · c9c6f2be
  Morris Jette authored Apr 13, 2017
```
Sinfo command with support for federated clusters
```
  c9c6f2be
- Merge branch 'federation' · d336c234
  Brian Christiansen authored Apr 13, 2017
  
  d336c234
- Increase test37.5 script runtime · 1fb55a2c
  Brian Christiansen authored Apr 12, 2017
```
The job was completing before the test could find that the job was
running. The script was running for 10 seconds and wait_for_fed_job can
back off and check only every 10 seconds.
```
  1fb55a2c
- Handle requeue of pending fed job. · 5678a748
  Brian Christiansen authored Apr 12, 2017
```
Can't use the revoked state to determine if the job is pending or not
since an origin job could be revoked if it doesn't have an active job on
itself.
```
  5678a748
- Preserve cluster rpc queue on fed update · 42322197
  Brian Christiansen authored Apr 11, 2017
  
  42322197
- Remove dead code · ee70be04
  Morris Jette authored Apr 11, 2017
```
Coverity CID 45347
```
  ee70be04