Commits · 154800b97c7c7cb79175905d4927968bac8ecdd0 · Manuel G. Marciani / ces_slurm_simulator

16 Nov, 2017 3 commits
- Kill whole pack on on PrologSlurmctld failure · 154800b9
  Dominik Bartkiewicz authored Nov 16, 2017
```
If PrologSlurmctld fails for pack job leader then kill all
    components of the job.
bug 4379
```
  154800b9
- Expand pack job prolog/epilog environment · 35374ec5
  Dominik Bartkiewicz authored Nov 16, 2017
```
Add SLURM_PACK_JOB_NODELIST to PrologSlurmctld and EpilogSlurmctld
    environment.
bug 4379
```
  35374ec5
- Fix for heterogeneous job starvation bug · 9e0b976a
  Morris Jette authored Nov 15, 2017
```
bug 4370
```
  9e0b976a
15 Nov, 2017 10 commits

Fix some slurmctld memory leaks · 29833f86
Morris Jette authored Nov 15, 2017

29833f86
Fix typo in log message · 89edbd4b
Morris Jette authored Nov 15, 2017

89edbd4b

Prevent pack job scheduling deadlock · d26179c2

Morris Jette authored Nov 15, 2017

Prevent scheduling deadlock with multiple components of heterogeneous job
    in different partitions (i.e. one heterogeneous job component is higher
    priority in one partition and another component is lower priority in a
    different partition).
bug 4370

d26179c2

Fix issue where heterogeneous jobs were not properly purged in some cases. · 14f70bd4

Alejandro Sanchez authored Nov 15, 2017

Issue could be reproduced by restarting slurmctld after a heterogeneous job
finished, but before MinJobAge time passed. Since the pack_job_list
job_record memeber wasn't saved/loaded to/from the job_state, the function
_validate_pack_jobs() is responsible for rebuilding the pack_job_list. Issue
was that the function was skiping the rebuild work for finished jobs, thus
other functions like the thread responsible for purging old jobs was failing
to iterate over a NULL pack_job_list which was never rebuilt.

Bug 4383.

14f70bd4

lua env handling makes slurmctld crash · 7e3912b3

Felip Moll authored Nov 15, 2017

If run from srun and lua job submit plugin sets environment, slurmctld
crashes.

Bug#4247

7e3912b3

job_submit/lua - add some packjobs fields available. · 40f8d6a3

Alejandro Sanchez authored Nov 15, 2017

From within slurm_job_submit():
	job_desc.pack_job_offset

From within slurm_job_modify():
	job_rec.pack_job_id
	job_rec.pack_job_id_set
	job_rec.pack_job_offset

Bug 4372.

40f8d6a3

Changed units of sched_min_interval in doc from sec to usec · 440c5572
Felip Moll authored Nov 15, 2017
```
bug 4339
```
440c5572

Sanity check env.c · bf941bf3

Felip Moll authored Nov 15, 2017

added some additional checks to prevent segfaults in some basic situations.
Bug 4247

bf941bf3

sbatch: check for incorrect ':' string in sbatch · ca2db47e
Felip Moll authored Nov 15, 2017
```
bug 4368
```
ca2db47e

Expand prolog/epilog environment · c3036325

Dominik Bartkiewicz authored Nov 15, 2017

Add SLURM_PACK_JOB_ID and SLURM_PACK_JOB_OFFSET to PrologSlurmctld and
  EpilogSlurmctld environment
bug 4379

c3036325

14 Nov, 2017 2 commits
- Remove misleading comment · 490eb537
  Morris Jette authored Nov 14, 2017
  
  490eb537
- Fix for srun abort on dead pack job · e13ba85e
  Morris Jette authored Nov 14, 2017
```
Avoid srun abort trying to run on heterogeneous job component that has
    ended.
bug 4366
```
  e13ba85e
13 Nov, 2017 4 commits
- Fix for invalid memory reference · dd0b0a70
  Morris Jette authored Nov 13, 2017
```
bug 4374
```
  dd0b0a70
- Return information about pack job components · 64009e63
  Morris Jette authored Nov 13, 2017
```
Do so even if pack-group 0 is completed, so long as not all components are completed
bug 4374
```
  64009e63
- Cosmetic changes · 32e6bb8e
  Morris Jette authored Nov 13, 2017
  
  32e6bb8e
- Docs - fix slurmdbd.conf reference to MaxQueryTimeRange. · eec6fb97
  Tim Wickberg authored Nov 13, 2017
```
In a prior incarnation of the patch that introduced it,
it was MaxQueryTimeLimit, and that was not updated with
the code base when changed.

Bug 4365.
```
  eec6fb97
10 Nov, 2017 6 commits
- Start NEWS for v17.11.0rc4 · 532c0f6d
  Tim Wickberg authored Nov 09, 2017
  
  532c0f6d
- Update META for v17.11.0rc3. · b8dfd716
  Tim Wickberg authored Nov 09, 2017
```
Update slurm.spec and slurm.spec-legacy as well
```
  b8dfd716
- Merge branch 'slurm-17.02' into slurm-17.11 · d0126e66
  Tim Wickberg authored Nov 09, 2017
  
  d0126e66
- Fix debug3 message to print correct field. · 2896fce8
  Felip Moll authored Nov 09, 2017
```
Bug 4323.
```
  2896fce8
- Show reason field in 'sinfo -R' for nodes in failed state. · 8ae16f28
  Isaac Hartung authored Nov 09, 2017
```
This now matches the sinfo documentation.

Bug 4306.
```
  8ae16f28
- Revert "Address a race condition in the extern step launch." · 9f6eb9bf
  Tim Wickberg authored Nov 09, 2017
```
The race condition this is avoiding has been fixed elsewhere.

This reverts commit 6c21c8bd.
```
  9f6eb9bf
09 Nov, 2017 15 commits

Check return code of acct_gather_conf_init in the stepd · d0c068ca
Danny Auble authored Nov 09, 2017
```
Coverity 178912
```
d0c068ca
Fix regression from commit f4bf82c3 · 735669f8
Danny Auble authored Nov 09, 2017
```
Coverity CID 178913
```
735669f8
Missed one. · 105868eb
Danny Auble authored Nov 09, 2017

105868eb
Continuation of last patch. · 7440c8d7
Danny Auble authored Nov 09, 2017

7440c8d7
doc changed references to old slurm versions · dd9bb165
Felip Moll authored Nov 07, 2017
```
Removed references to versions <=16.05 and adapted to new 17.11
```
dd9bb165

launch/slurm plugin fix global memory re-use · a23c1032

Morris Jette authored Nov 09, 2017

launch/slurm plugin - Avoid using global variable for heterogeneous job
    steps, which could corrupt memory.
bug 4333

a23c1032

Retry MPI reserved port logic only for non-pack job steps · d64a5f67

Morris Jette authored Nov 09, 2017

Ancient versions of OpenMPI and their derivatives (i.e. Cray MPI) are
dependent upon communication ports being assigned to them by Slurm. Such MPI
jobs will experience step launch failure if any component of a
heterogeneous job step is unable to acquire the allocated ports.
Non-heterogeneous job steps will retry step launch using a new set of
communication ports (no change in Slurm behavior).

NOTE: Correcting this would necessitate assigning the same set of ports
to all components of the heterogeneous job (not possible today) plus changes to
srun in order to better synchronize the step startup and error handling.

d64a5f67

If any acct_gather_*_init fails fatal instead of error and keep going. · f4bf82c3
Dominik Bartkiewicz authored Nov 09, 2017
```
Same logic as done in commit fb296c70 done for energy.

Bug 4336
```
f4bf82c3

Heterogeneous step fix · 4dcd139d

Morris Jette authored Nov 09, 2017

If heterogeneous job step is unable to acquire MPI reserved ports then
    avoid referencing NULL pointer.
bug 4333

4dcd139d

Continuation of · 331912f1

Danny Auble authored Nov 09, 2017

Force tres change on a job to send data to the database.

This should be happening already, but this just makes it always happen.

331912f1

Fix regression in commit · cba14304

Danny Auble authored Nov 09, 2017

This fixes the possibility of going into this loop when we hadn't
setup the tres_req_cnt.  The simple case Coverity reported is if the
job is already finished it goes here and we never set up tres_req_cnt.

Coverity CID 178897

cba14304

Fix regression in · 7fd28ab6

Danny Auble authored Nov 09, 2017

This fixes the possibility of referencing a NULL pointer if the
reservation doesn't exist anymore when testing.

Coverity CID 178898

7fd28ab6

X11 forwarding - fix keepalive messaging support. · 6985ccba
Tim Wickberg authored Nov 09, 2017
```
Bug 3647.
```
6985ccba
Add note to release notes about contribs/slurm.spec-legacy. · c3c58ea5
Tim Wickberg authored Nov 09, 2017
```
Bug 4353.
```
c3c58ea5
slurm.spec - install Slurm's libpmi in an alternate location. · 1ef2d9bf
Doug Jacobsen authored Nov 08, 2017
```
Also collapse a nested %{with cray} block leftover from earlier work.

Bug 4332.
```
1ef2d9bf