Commits · be449407dc66f9edd8d04a11d3041eb1b091954a · Manuel G. Marciani / ces_slurm_simulator

06 Jun, 2018 1 commit

Don't allocate downed cloud nodes · be449407

Brian Christiansen authored Jun 05, 2018

which were marked down due to ResumeTimeout.

If a cloud node was marked down due to not responding by ResumeTimeout,
the code inadvertently added the node back to the avail_node_bitmap --
after being cleared by set_node_down_ptr(). The scheduler would then
attempt to allocate the node again, which would cause a loop of hitting
ResumeTimeout and allocating the downed node again.

Bug 5264

be449407

05 Jun, 2018 2 commits
- Add LaunchParameters=send_gids to contribs/cray/slurm.conf.template. · ad27675d
  Tim Wickberg authored Jun 04, 2018
```
Bug 5180.
```
  ad27675d
- Add --without x11 option to rpmbuild in slurm.spec. · 5c5e10f8
  Killian authored Jun 04, 2018
```
Bug 5206.
```
  5c5e10f8
04 Jun, 2018 1 commit

Increase timeout for slurmdbd records · 84bcb04c

Morris Jette authored Jun 04, 2018

I was seeing rare failures on the test due to timing issues.
This increased timeout seems to fix the issue for me.

84bcb04c

02 Jun, 2018 1 commit
- Update PropagateResourceLimits docs · 4aabb3ab
  Michael Hinton authored Jun 01, 2018
```
NONE was not documented.

Bug 5161
```
  4aabb3ab
01 Jun, 2018 1 commit
- Remove test output file · 23737c5b
  Morris Jette authored Jun 01, 2018
```
Avoid left-over test input file
```
  23737c5b
31 May, 2018 4 commits

Correct data in SLUG CFP · b4033682
Morris Jette authored May 31, 2018

b4033682
Fix for bad variable initialization · dda4cac3
Morris Jette authored May 31, 2018

dda4cac3
Make uniform the way we free a resource allocation response message. · 01fc3c0f
Danny Auble authored May 31, 2018
```
No functional change.

Bug 4999.
```
01fc3c0f

Fix incomplete RESPONSE_[RESOURCE|JOB_PACK]_ALLOCATION building path. · 17392e76

Alejandro Sanchez authored May 31, 2018

There were two code paths building an allocation response by calling
its own static _build_alloc_msg() function:

1. src/slurmctld/proc_req.c
2. src/slurmctld/srun_comm.c

These two functions diverged and both had members that were not filled
in but were filled in the other. This patch makes it so we change the
signature of the one in proc_req.c to make it extern and then in
srun_comm.c we call this newly common function.

Also added cpu_freq_[min|max|gov] members in the common one since these
were the only members missing in proc_req.c function (the one in
srun_comm.c had more members missing, like all the ntasks_per*, account,
qos or resv_name).

Bug 4999.

17392e76

30 May, 2018 24 commits
- Start NEWS for v17.11.8 · b96f0826
  Tim Wickberg authored May 30, 2018
  
  b96f0826
- Update META for v17.11.7. · 931745cf
  Tim Wickberg authored May 30, 2018
```
Update slurm.spec and slurm.spec-legacy as well
```
  931745cf
- Merge branch 'slurm-17.02' into slurm-17.11 · c30525a0
  Tim Wickberg authored May 30, 2018
  
  c30525a0
- Start NEWS for v17.02.12 · ca88df8c
  Tim Wickberg authored May 30, 2018
  
  ca88df8c
- Update META for v17.02.11 tag · 58d12386
  Tim Wickberg authored May 30, 2018
  
  58d12386
- Fix insecure handling of job requested gid. · 033dc0d1
  Marshall Garey authored May 29, 2018
```
Only trust MUNGE signed values, unless the RPC was signed by
SlurmUser or root.

CVE-2018-10995.
```
  033dc0d1
- Merge branch 'slurm-17.02' into slurm-17.11 · 92b92042
  Tim Wickberg authored May 30, 2018
  
  92b92042
- Remove redundant security checks for performance reasons. · 97ecaf03
  Tim Wickberg authored May 29, 2018
```
Already vetted by slurmctld/slurmd, no need to re-check here.
```
  97ecaf03
- Remove client-side validation on Native Cray systems. · ba6b9cfd
  Tim Wickberg authored May 29, 2018
  
  ba6b9cfd
- Validate gid and user_name values provided to slurmd up front. · df545955
  Tim Wickberg authored May 29, 2018
```
Do not defer until later, and do not potentially miss out on proper
validation of the user_name field which can lead to improper authentication
handling.

CVE-2018-10995.
```
  df545955
- Validate gid value in slurmctld, rather than deferring to slurmd/slurmstepd. · 75c3627c
  Tim Wickberg authored May 29, 2018
```
If the auth value (from MUNGE) does not match the requested value,
ensure it is listed as a valid extended gid for that user instead.
```
  75c3627c
- Do not swap in gid 0 in place of NO_VAL. · d57a0714
  Tim Wickberg authored May 29, 2018
  
  d57a0714
- Fix build issue on CentOS7 with HDF5 installed. · 85748280
  Tim Wickberg authored May 30, 2018
  
  85748280
- Fix invalid free() if HWLOC does not return a malloc()'d string. · 83c92a4d
  Dominik Bartkiewicz authored May 30, 2018
```
Bug 5038.
```
  83c92a4d
- Fix persistent typo. · baf8344e
  Tim Wickberg authored May 30, 2018
  
  baf8344e
- SLUG18 - conference dinner is Tuesday, not Monday · af27757d
  Tim Wickberg authored May 30, 2018
  
  af27757d
- Merge branch 'slurm-17.02' into slurm-17.11 · b5837d3f
  Tim Wickberg authored May 30, 2018
  
  b5837d3f
- Add ESLURM_GROUP_ID_MISSING error. · 26996fa8
  Tim Wickberg authored May 29, 2018
```
Value of 2113 is where it fits in with 17.11, so pin it here.
```
  26996fa8
- SLUG info update · f227eec4
  Morris Jette authored May 30, 2018
  
  f227eec4
- Update SLUG18 agenda · 36a518b4
  Morris Jette authored May 30, 2018
  
  36a518b4
- Update SLUG18 info · 952fd271
  Morris Jette authored May 30, 2018
  
  952fd271
- Change pthread_cond_signal to slurm_cond_signal. · 657aba97
  Michael Hinton authored May 29, 2018
  
  657aba97
- Fix race in pmixp_agent_start(). · 7c1dad6e
  Tim Wickberg authored May 29, 2018
```
Caused by pthread_cancel cleanup by commit e5f03971  in 17.11.6.

Bug 5181.
```
  7c1dad6e
- Fix deadlock in slurmstepd during shutdown. · 21ca33f5
  Tim Wickberg authored May 16, 2018
```
The race condition was created in a7c8964e in 17.11.6 when removing
the (unsafe) pthread_cancel code handling thread termination.

Bug 5164
```
  21ca33f5
24 May, 2018 1 commit

Notify srun and ctld when unkillable stepd exits · 956a808d

Brian Christiansen authored May 16, 2018

Commits f18390e8 and eed76f85 modified the stepd so that if the
stepd encountered an unkillable step timeout that the stepd would just
exit the stepd. If the stepd is a batch step then it would reply back
to the controller with a non-zero exit code which will drain the node.
But if an srun allocation/step were to get into the unkillable step
code, the steps wouldn't let the waiting srun or controller know about
the step going away -- leaving a hanging srun and job.

This patch enables the stepd to notify the waiting sruns and the ctld of
the stepd being done and drains the node for srun'ed alloction and/or
steps.

Bug 5164

956a808d

21 May, 2018 1 commit
- _post_qos_list() modifies global variables · da1eb7c7
  Dominik Bartkiewicz authored May 21, 2018
```
g_qos_count, g_qos_max_priority, must be call under qos write lock.

Bug 5159.
```
  da1eb7c7
19 May, 2018 2 commits
- Fix warning message in test17.7 · c7ddc591
  Brian Christiansen authored May 18, 2018
```
Display correct path.
```
  c7ddc591
- Fix test17.7 to work with symlinked home dirs · c3b41366
  Bjørn-Helge Mevik authored May 18, 2018
```
Bug 5151
```
  c3b41366
18 May, 2018 2 commits

Update test3.4 for regular user · 9d82faa1

Brian Christiansen authored May 18, 2018

Commits 4454316e and 76706b51 adjusted the updating of priority logic so
that when a non-authorized user modifies the priority it will only be
temporary -- in most cases the user will never see that change.

Bug 5151

9d82faa1

Update limits on GRES docs · 1e1cd45e
Marshall Garey authored May 18, 2018
```
Clarification of c2c06468.

Bug 5150
```
1e1cd45e