Commits · 5fb94be9598ec48f6155cbb2d57e0bcb8065069e · Manuel G. Marciani / ces_slurm_simulator

20 Dec, 2017 9 commits

slurm.spec - move libpmi libraries to a separate package. · 5fb94be9

Danny Auble authored Dec 20, 2017

The slurm and pmix RPMs both attempt to install a libpmi. While Slurm's
version should be preferred, the best way to handle this is to have both
pmix and slurm packages split this off into a separate RPM that can
conflict with each other, giving the admin a way to resolve the conflict
manually.

Bug 4511.

5fb94be9

Fix reading of variable outside of locks · e3d979c0

Brian Christiansen authored Dec 19, 2017

The cluster name could have been free'd before it would be strdup'ed.
Happens when a federated persisent connection is established and the
cluster needs to sync up with the remote cluster.

Bug 4503

e3d979c0

Sync origin job with cancelled sibling job · 852c2bf3

Brian Christiansen authored Dec 05, 2017

Remove the origin job when a remote sibling job was cancelled while the
origin was down.

e.g.
job originated on cluster1
job runs on cluster2
cluster1 goes down
job is cancelled on cluster2
cluster1 comes backup. The origin job should be removed if the sibling
job was cancelled.

852c2bf3

Set the user_name field if not provided. · 45be4cc0

Tim Wickberg authored Dec 20, 2017

The user_name is only populated if LaunchParameters=send_gids is set.
Without this field set, the pam_setup call will get NULL instead of the
user_name, leading to batch job launch failures. (The step launch code
already handles this same issue separately.)

Bug 4412.

45be4cc0

Fix for record job state on successful allocation but failed reply message. · 8dd79221

Alejandro Sanchez authored Dec 20, 2017

On a job [pack]allocation RPC request, if the allocation succeed but the
send response message back to the client failed (i.e. srun was killed before
it could receive the response), then modify the job_record pointer so that
the job_state is set to FAILED, the exit_code as if the job got a SIGTERM
signal and the state_reason to FAIL_LAUNCH. Then users when querying
the job with sacct can discern that something bad happend for this scenario,
instead of STATE being showed as COMPLETED and the ExitCode as 0:0.

Bug 4513.

8dd79221

Prevent possible double-xfree on buffer in stepd_completion. · 973ac201

Felip Moll authored Dec 19, 2017

Use FREE_NULL_BUFFER instead, otherwise we could attempt to
free_buffer this a second time if we jump to the rwfail label.

Bug 4484.

973ac201

Fix the display for NNodes in sacct output when using --units option. · 763e396e

Felip Moll authored Dec 11, 2017

When printing fields in sacct with user specified units (--units), the
nnodes field showed an incorrect string. This commit reverts a65fa572
and avoids the unit conversion, which does not make sense outside the
context of a Blue Gene systems (deprecated) anyways.

Bug 4490.

763e396e

slurm.spec - fix issue with rpm 4.13+ / Fedora 25+. · a9388ec7

Felip Moll authored Dec 19, 2017

Slurm may generate empty manifest files depending on configuration
and library availability. Disable the new empty manifest check to
allow builds to proceed with rpm 4.13+ / Fedora 25+.

Bug 4453.

a9388ec7

Add news entry for 3552b25f . · 313a6cca
Morris Jette authored Dec 19, 2017

313a6cca

19 Dec, 2017 5 commits
- DBD - When using LogTimeFormat=thread_id fill in the cluster name · 5d17592b
  Danny Auble authored Dec 19, 2017
```
before printing anything for a connection.
```
  5d17592b
- burst_buffer/cray - Append information about errors to job's AdminComments · 540d0e5c
  Morris Jette authored Dec 18, 2017
```
field.

Bug 4529
```
  540d0e5c
- burst_buffer/cray - Do not purge a job record if it's stage-out operation · 35d8e193
  Morris Jette authored Dec 18, 2017
```
fails. The description of the failure will be in the job's "Reason" field.

Bug 4529
```
  35d8e193
- burst_buffer/cray - Plug small memory leak on DataWarp create_persistent · 040c6a0a
  Morris Jette authored Dec 18, 2017
```
buffer error.

Bug 4529
```
  040c6a0a
- Docs - add Slurm/PMIx and OpenMPI build notes to the mpi_guide page. · 34209c47
  Alejandro Sanchez authored Dec 19, 2017
```
Bug 4222.
```
  34209c47
18 Dec, 2017 2 commits

Don't set job_id_sequence to highest jobid in use · 7d83f77d

Brian Christiansen authored Dec 18, 2017

on startup. Just use the checkpointed job_id_sequence. get_next_job_id()
will not use a jobid if it's in use in the system.

Bug 4538

7d83f77d

Fix for knl_generic plugin load failure · 77956598

Morris Jette authored Dec 18, 2017

node_features/knl_generic - If plugin can not fully load then do not spawn
    a background pthread (which will fail with invalid memory reference).

77956598

15 Dec, 2017 4 commits
- BB - Only launch dependent jobs after the burst buffer is staged-out · 0c7b30fe
  Morris Jette authored Dec 15, 2017
```
completely instead of right after the parent job finishes.

Bug 4516
```
  0c7b30fe
- Avoid changing Slurm internal errno on syslog() failure · 148185fb
  Yair Yarom authored Dec 15, 2017
```
bug 3582
```
  148185fb
- Fix divide by zero case · 4730cbd6
  Brian Christiansen authored Dec 15, 2017
```
when a job requests no tasks and more memory than MaxMemPer{CPU|NODE}.

e.g.
sbatch --wrap="sleep 10"

Bug 4515
```
  4730cbd6
- Initialize job_desc_msg_t's instead of just memset · 6ec0e85b
  Brian Christiansen authored Dec 15, 2017
```
This will give expected results. Found while working on Bug 4515.
```
  6ec0e85b
14 Dec, 2017 1 commit
- If unrecognized configuration file option found set errno value. · dd80bc57
  Danny Auble authored Dec 14, 2017
```
And print an appropriate fatal error message rather than relying
upon random errno value.

Bug 4523
```
  dd80bc57
13 Dec, 2017 1 commit
- Make slurmd oom-unkillable by default, using oom.h OOM_SCORE_ADJ_MIN macro. · 8325e9f8
  Alejandro Sanchez authored Dec 13, 2017
```
Bug 4478.
```
  8325e9f8
12 Dec, 2017 1 commit

Update jobcomp records on duplicate inserts · 003f65c3

Brian Christiansen authored Dec 12, 2017

In the federation case, the origin job is completed in the database when
a sibling job starts the job. The complete message is then sent again to
the database when the job is completed on the sibling cluster but
it is updated with the sibling job's exit code.

The jobcomp plugin didn't handle the multiple updates to the record.
This change allows the existing record to be updated.

Bug 4493

003f65c3

11 Dec, 2017 2 commits

CRAY - Switch to standard pid files on Cray systems. · 5342d979

David Gloe authored Dec 11, 2017

Bug 4500

The pid files in slurm.conf and the systemd service files must match,
or systemd will time out looking for the wrong pid file. Currently,
the Cray slurm.conf template has different pid files for slurmctld and
slurmd than the service files.

There's no reason for us to use these nonstandard pid files, and it
will save us some headaches to switch over.

5342d979

Add ability for squeue to sort jobs by submit time · ecd81545
Marcin Stolarek authored Dec 11, 2017
```
bug 4496
```
ecd81545

08 Dec, 2017 2 commits

Fix Slurm to work correctly with HDF5 1.10+. · 006d172a

Danny Auble authored Dec 08, 2017

In 1.10+ they changed the hid_t from an int to a long int which
messes things up as they use the top 32 bits for stuff right off
the bat.  This fixes the scenario by handing the number with a int32_t
instead of an int.

Bug 3795

006d172a

Fix node reboot timeout problem · 41c6ac6a
Morris Jette authored Dec 08, 2017
```
Fix potential node reboot timeout problem for "scontrol reboot" command.
bug 4203
```
41c6ac6a

07 Dec, 2017 4 commits
- X11 forwarding - fix xauth regex to allow for - in hostnames. · 312b7332
  Tim Wickberg authored Dec 07, 2017
```
The - character is treated as a range if not first or last in the
[] brackets. Moving it in between . and / broke the regex subtly.

Inadvertently broken by a268b644.

Bug 4417.
```
  312b7332
- Fix handling ultra long hostlists in a hostfile. · 4c1c1e40
  Danny Auble authored Dec 07, 2017
```
Bug 4169
```
  4c1c1e40
- Fix for srun --pack-group option that can reuse/corrupt memory · 74f863f4
  Morris Jette authored Dec 07, 2017
```
Found using test38.17
```
  74f863f4
- Truncate SlurmctldPort range to FD_SETSIZE ports at most (usually 1024). · 620f92f7
  Felip Moll authored Dec 06, 2017
```
Otherwise poll() cannot monitor these ports properly,
leading to potential network traffic problems.

Bug 4467.
```
  620f92f7
06 Dec, 2017 3 commits

Wait to start a batch step · a42d325f

Danny Auble authored Dec 06, 2017

until the prolog and extern step are fully ran/launched.

Only matters if running with PrologFlags=[contain|alloc].

patch 2 of 2

Bug 4458

a42d325f

Only say a prolog is done running after the extern step is launched. · 71e89f7b
Danny Auble authored Dec 06, 2017
```
Patch 1 of 2

Bug 4458
```
71e89f7b

Replace slurm_playbook.yaml.in with a version that auto-detects the Slurm version. · d54eb247

David Gloe authored Dec 05, 2017

Due to the way Cray builds Slurm, the prefix and bindir paths include the
Slurm version (/opt/slurm/<version>). This means every time we update to a
new Slurm version we must update the Slurm ansible playbook. It also means
that the slurm_playbook.yaml file must be built with Slurm to be used
(it can't simply be copied directly).

The attached patch updates the playbook to determine the version of Slurm
to use from the module file, and hardcodes the sysconfdir setting we give
in our Slurm installation guide. If a customer uses different paths, they
can update the playbook to meet their needs.

Bug 4360.

d54eb247

05 Dec, 2017 6 commits
- Return ESLURM_TRANSITION_STATE_NO_UPDATE instead of EAGAIN · 3978122b
  Dominik Bartkiewicz authored Dec 05, 2017
```
when trying to signal a step that is still running a prolog.

Bug 4446
```
  3978122b
- Fix uid check when requesting a jobid from a pid. · d3141dc9
  Dominik Bartkiewicz authored Dec 05, 2017
```
Bug 4446
```
  d3141dc9
- Fix uid check for signaling a step with anything but SIGKILL. · 26853610
  Dominik Bartkiewicz authored Dec 05, 2017
```
Bug 4446
```
  26853610
- mpi/pmix: Fix the job registration for the PMIx v2.1 · b61fe2ac
  Artem Polyakov authored Dec 05, 2017
```
Bug 4131
```
  b61fe2ac
- Make logging prefix for slurmstepd to happen as soon as possible. · 9a66f0c8
  Danny Auble authored Dec 01, 2017
```
Simplify the step prefix process
and move it as early as possible in the step.
```
  9a66f0c8
- Fix to properly remove extern steps from the starting_steps list. · 99b3796b
  Alejandro Sanchez authored Dec 05, 2017
```
Since NO_VAL = SLURM_BATCH_SCRIPT, the else statement would only compare
the job_id and not the step_id, thus when a batch step was removed all
the steps from that job would be removed too. Then when attempting to
remove the extern step in the next iteration, it was already removed
and we were incorrectly erroring out.

Bug 4458.
```
  99b3796b