Commits · 0296f3ec55815f190f253bb89a6fc2788be10b0b · Manuel G. Marciani / ces_slurm_simulator

20 Dec, 2017 19 commits

Continuation of last commit · 0296f3ec
Danny Auble authored Dec 20, 2017
```
Bug 4483
```
0296f3ec
mpi/pmix: add the thresholds for the parameters of micro-benchmarks · 38a9fa29
Boris Karasev authored Dec 10, 2017
```
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
```
38a9fa29
Have srun read in modern env var's instead of obsolete ones · 46378a3f
Danny Auble authored Dec 20, 2017
```
Bug 4541
```
46378a3f

slurm.spec - move libpmi libraries to a separate package. · 5fb94be9

Danny Auble authored Dec 20, 2017

The slurm and pmix RPMs both attempt to install a libpmi. While Slurm's
version should be preferred, the best way to handle this is to have both
pmix and slurm packages split this off into a separate RPM that can
conflict with each other, giving the admin a way to resolve the conflict
manually.

Bug 4511.

5fb94be9

Fix reading of variable outside of locks · e3d979c0

Brian Christiansen authored Dec 19, 2017

The cluster name could have been free'd before it would be strdup'ed.
Happens when a federated persisent connection is established and the
cluster needs to sync up with the remote cluster.

Bug 4503

e3d979c0

Sync origin job with cancelled sibling job · 852c2bf3

Brian Christiansen authored Dec 05, 2017

Remove the origin job when a remote sibling job was cancelled while the
origin was down.

e.g.
job originated on cluster1
job runs on cluster2
cluster1 goes down
job is cancelled on cluster2
cluster1 comes backup. The origin job should be removed if the sibling
job was cancelled.

852c2bf3

Update some heterogeneous job documenation verbage · ec54839a
Morris Jette authored Dec 20, 2017
```
Replace some references to "pack job" with "heterogenous job".
bug 4546
```
ec54839a

Set the user_name field if not provided. · 45be4cc0

Tim Wickberg authored Dec 20, 2017

The user_name is only populated if LaunchParameters=send_gids is set.
Without this field set, the pam_setup call will get NULL instead of the
user_name, leading to batch job launch failures. (The step launch code
already handles this same issue separately.)

Bug 4412.

45be4cc0

Write burst buffer logs to job's "comment" · 2a17e7d7
Morris Jette authored Dec 20, 2017
```
Previously written to "admin_comment" in commit 540d0e5c
bug 4529
```
2a17e7d7
Continuation to commit 5d17592b · d1e19e21
Danny Auble authored Dec 20, 2017

d1e19e21

Fix for record job state on successful allocation but failed reply message. · 8dd79221

Alejandro Sanchez authored Dec 20, 2017

On a job [pack]allocation RPC request, if the allocation succeed but the
send response message back to the client failed (i.e. srun was killed before
it could receive the response), then modify the job_record pointer so that
the job_state is set to FAILED, the exit_code as if the job got a SIGTERM
signal and the state_reason to FAIL_LAUNCH. Then users when querying
the job with sacct can discern that something bad happend for this scenario,
instead of STATE being showed as COMPLETED and the ExitCode as 0:0.

Bug 4513.

8dd79221

Add preliminary PrologFlags=X11 documentation to the slurm.conf man page. · f4494704
Tim Wickberg authored Dec 19, 2017

f4494704
Document LaunchParameters=send_gids option. · fbef6ea1
Tim Wickberg authored Dec 19, 2017

fbef6ea1

Prevent possible double-xfree on buffer in stepd_completion. · 973ac201

Felip Moll authored Dec 19, 2017

Use FREE_NULL_BUFFER instead, otherwise we could attempt to
free_buffer this a second time if we jump to the rwfail label.

Bug 4484.

973ac201

Fix the display for NNodes in sacct output when using --units option. · 763e396e

Felip Moll authored Dec 11, 2017

When printing fields in sacct with user specified units (--units), the
nnodes field showed an incorrect string. This commit reverts a65fa572
and avoids the unit conversion, which does not make sense outside the
context of a Blue Gene systems (deprecated) anyways.

Bug 4490.

763e396e

slurm.spec - fix issue with rpm 4.13+ / Fedora 25+. · a9388ec7

Felip Moll authored Dec 19, 2017

Slurm may generate empty manifest files depending on configuration
and library availability. Disable the new empty manifest check to
allow builds to proceed with rpm 4.13+ / Fedora 25+.

Bug 4453.

a9388ec7

Fix comment about select() vs. poll(). select() is limited to FD_SETSIZE. · aff5f2c7
Felip Moll authored Dec 19, 2017
```
(Fixing tim@schedmd.com's mistake on prior commit.)

Bug 4467.
```
aff5f2c7
Add news entry for 3552b25f . · 313a6cca
Morris Jette authored Dec 19, 2017

313a6cca

Add lustre_no_flush option to LaunchParameters for Native Cray systems. · 3552b25f

Morris Jette authored Dec 19, 2017

When set, do not flush the Lustre cache through the alpsc_flush_lustre()
call, and do not drop caches through /proc/sys/vm/drop_caches either.

This avoids a potential source of SIGBUG errors for other jobs sharing
the node.

Bug 4309.

3552b25f

19 Dec, 2017 5 commits
- DBD - When using LogTimeFormat=thread_id fill in the cluster name · 5d17592b
  Danny Auble authored Dec 19, 2017
```
before printing anything for a connection.
```
  5d17592b
- burst_buffer/cray - Append information about errors to job's AdminComments · 540d0e5c
  Morris Jette authored Dec 18, 2017
```
field.

Bug 4529
```
  540d0e5c
- burst_buffer/cray - Do not purge a job record if it's stage-out operation · 35d8e193
  Morris Jette authored Dec 18, 2017
```
fails. The description of the failure will be in the job's "Reason" field.

Bug 4529
```
  35d8e193
- burst_buffer/cray - Plug small memory leak on DataWarp create_persistent · 040c6a0a
  Morris Jette authored Dec 18, 2017
```
buffer error.

Bug 4529
```
  040c6a0a
- Docs - add Slurm/PMIx and OpenMPI build notes to the mpi_guide page. · 34209c47
  Alejandro Sanchez authored Dec 19, 2017
```
Bug 4222.
```
  34209c47
18 Dec, 2017 4 commits
- Don't set job_id_sequence to highest jobid in use · 7d83f77d
  Brian Christiansen authored Dec 18, 2017
```
on startup. Just use the checkpointed job_id_sequence. get_next_job_id()
will not use a jobid if it's in use in the system.

Bug 4538
```
  7d83f77d
- Fix for knl_generic plugin load failure · 77956598
  Morris Jette authored Dec 18, 2017
```
node_features/knl_generic - If plugin can not fully load then do not spawn
    a background pthread (which will fail with invalid memory reference).
```
  77956598
- Cosmetic changes · 3cc34e80
  Morris Jette authored Dec 18, 2017
  
  3cc34e80
- Avoid using knl_generic plugin except on real KNL nodes · 412846ed
  Dominik Bartkiewicz authored Dec 18, 2017
```
Add "Force=1" to knl_generic.conf to override (for testing)
bug 4487
```
  412846ed
16 Dec, 2017 1 commit

Fix for bitstring interpretation on big-endian machine · 5a23c48a

Felip Moll authored Dec 13, 2017

This patch fix commit b31bb7 for big-endian machines inverting the
__builtin_clzll and __builtin_cltll calls depending on the architecture.
This caused to get the incorrect first and last bit set on the bitmap in
bit_ffs bit_fls.

bug 4494

5a23c48a

15 Dec, 2017 6 commits
- BB - Only launch dependent jobs after the burst buffer is staged-out · 0c7b30fe
  Morris Jette authored Dec 15, 2017
```
completely instead of right after the parent job finishes.

Bug 4516
```
  0c7b30fe
- Remove some old FAQ information · c3d7bc5c
  Morris Jette authored Dec 15, 2017
  
  c3d7bc5c
- Avoid changing Slurm internal errno on syslog() failure · 148185fb
  Yair Yarom authored Dec 15, 2017
```
bug 3582
```
  148185fb
- Fix divide by zero case · 4730cbd6
  Brian Christiansen authored Dec 15, 2017
```
when a job requests no tasks and more memory than MaxMemPer{CPU|NODE}.

e.g.
sbatch --wrap="sleep 10"

Bug 4515
```
  4730cbd6
- Initialize job_desc_msg_t's instead of just memset · 6ec0e85b
  Brian Christiansen authored Dec 15, 2017
```
This will give expected results. Found while working on Bug 4515.
```
  6ec0e85b
- Fix problem setting SLURMSTEPD_OOM_ADJ introduced in 8325e9f8. · 2d0df2ed
  Danny Auble authored Dec 15, 2017
```
Bug 4478 comment 25.
```
  2d0df2ed
14 Dec, 2017 1 commit
- If unrecognized configuration file option found set errno value. · dd80bc57
  Danny Auble authored Dec 14, 2017
```
And print an appropriate fatal error message rather than relying
upon random errno value.

Bug 4523
```
  dd80bc57
13 Dec, 2017 2 commits
- Make slurmd oom-unkillable by default, using oom.h OOM_SCORE_ADJ_MIN macro. · 8325e9f8
  Alejandro Sanchez authored Dec 13, 2017
```
Bug 4478.
```
  8325e9f8
- Add documentation for pam_slurm_adopt. · c58a84c7
  Marshall Garey authored Dec 12, 2017
```
Based off of Ryan Cox's original contribs/pam_slurm_adopt/README.

Bug 3567.
```
  c58a84c7
12 Dec, 2017 2 commits

Fix spelling in man page. · ff88eccc
Brian Christiansen authored Dec 12, 2017

ff88eccc

Update jobcomp records on duplicate inserts · 003f65c3

Brian Christiansen authored Dec 12, 2017

In the federation case, the origin job is completed in the database when
a sibling job starts the job. The complete message is then sent again to
the database when the job is completed on the sibling cluster but
it is updated with the sibling job's exit code.

The jobcomp plugin didn't handle the multiple updates to the record.
This change allows the existing record to be updated.

Bug 4493

003f65c3