Commits · d971dd996528fd0a8504ea7d8850a2c9590e9fc5 · Manuel G. Marciani / ces_slurm_simulator

13 Jul, 2018 1 commit

SlurmDBD - improve error message on archive load failure. · 1c27a2e6

Isaac Hartung authored Jul 12, 2018

Add errno to info message in the SlurmDBD log, and pass the actual
errno back to the sacctmgr process so the user can see it.

Bug 5152.

1c27a2e6

12 Jul, 2018 3 commits

mpi/pmix: fixed the collectives canceling · f15c8183

Boris Karasev authored Jun 16, 2018

- avoid `abort()` when collective is failed
- added logging of coll details for fail cases

Bug 5067

f15c8183

Make code compile with hdf5 1.10.2+ · 90c4e7e7

Danny Auble authored Jul 12, 2018

Note, this is setting it up so we can use defunct functions.  It will
probably need to be properly fixed in a future version so we don't
do this.

90c4e7e7

Fix issues with --exclusive=[user|mcs] to work correctly · 72736af2
Dominik Bartkiewicz authored Jul 12, 2018
```
with preemption or when job requests a specific list of hosts.

Bug 5293.
```
72736af2

09 Jul, 2018 1 commit
- Add news for 4daeedd8 · d10854d9
  Danny Auble authored Jul 09, 2018
  
  d10854d9
06 Jul, 2018 1 commit

Fix leaking freezer cgroups. · 7f9c4f73

Marshall Garey authored Jul 06, 2018

Continuation of 923c9b37.

There is a delay in the cgroup system when moving a PID from one cgroup
to another. It is usually short, but if we don't wait for the PID to
move before removing cgroup directories the PID previously belonged to,
we could leak cgroups. This was previously fixed in the cpuset and
devices subsystems. This uses the same logic to fix the freezer
subsystem.

Bug 5082.

7f9c4f73

04 Jul, 2018 1 commit

Combine the active and available node feature change logs · 3818159e

Morris Jette authored Jul 04, 2018

So that multiple nodes changes will be reported on one line rather than one
line per node. Otherwise this could lead to performance issues when reloading
slurmctld in big systems.

Bug4980

3818159e

03 Jul, 2018 1 commit
- Fix _step_signal() from always returning success · 2ab24e04
  Brian Christiansen authored Jul 02, 2018
```
Currently, no caller checks the return code.

Bug 5164
```
  2ab24e04
26 Jun, 2018 4 commits

Fix problem when validating job memory on multi-partition requests. · f07f53fc

Dominik Bartkiewicz authored Jun 08, 2018

Some job fields can change in the course of scheduling. This patch
reinitializes previously adjusted job fields to their original value
when validating the job memory in multi-partition requests.

Bug 4895.

f07f53fc

Revert "Fix different issues when requesting memory per cpu/node." · d52d8f4f
Alejandro Sanchez authored Jun 08, 2018
```
This reverts commit bf4cb0b1.

Bug 5240, Bug 4895 and Bug 4976.
```
d52d8f4f

Prevent reboot of busy KNL node when asking for inactive features. · d8c5379b

Felip Moll authored Jun 26, 2018

When one asks for an inactive feature and also specifies the node with -w flag,
the node will be rebooted despite it may contain running jobs.

bug4821

d8c5379b

Reorder proctrack/task plugin load in the slurmstepd to match that of slurmd · 164da888
Tim Wickberg authored Jun 25, 2018
```
and avoid race condition calling task before proctrack can introduce.

Bug 5319
```
164da888

25 Jun, 2018 1 commit
- Add new job dependency type of "afterburstbuffer". The pending job will be · 3d4baee9
  Morris Jette authored Jun 25, 2018
```
delayed until the first job completes execution and it's burst buffer
stage-out is completed.

Bug 4675
```
  3d4baee9
22 Jun, 2018 1 commit
- Prevent slurmctld from abort when attempting to set non-existing qos as def_qos_id · c9682e1a
  Dominik Bartkiewicz authored Jun 22, 2018
```
Bug 5159.
```
  c9682e1a
20 Jun, 2018 1 commit

Make job_start_data() multi partition aware on REQUEST_JOB_WILL_RUN. · 35a13703

Alejandro Sanchez authored Jun 20, 2018

Previously the function was only testing against the first partition in
the job_record. Now it detects if the job request is multi partition and
if so then loops through all of them until the job will run in any or
until the end of the list, returning the error code from the last one if
the job won't run in any partition.

Bug 5185

35a13703

19 Jun, 2018 2 commits

Don't enforce MaxQueryTimeRange with specific jobs · d41cb31a

Isaac Hartung authored Jun 19, 2018

When requesting specific jobids with sacct, the starttime of the request
is 0, which will cause the time range to be outside of the
MaxQueryTimeRange range -- if specified. When requesting specific
jobids, sacct should be able to find the job whenever it started --
unless confined to a smaller range with -S and/or -E.

Bug 5009

d41cb31a

Fix send_gids info. in NEWS and RELEASE_NOTES. · 4c49b4bb
Felip Moll authored Jun 19, 2018

4c49b4bb

18 Jun, 2018 1 commit
- MySQL - Prevent deadlock caused by archive logic locking reads. · 4c448fd7
  Danny Auble authored Jun 18, 2018
```
Specifically due to SELECT ... FOR UPDATE ones.

Bug 5086.
```
  4c448fd7
15 Jun, 2018 2 commits
- job_submit/lua - fix access into reservation table. · d512be7b
  Marshall Garey authored Jun 15, 2018
```
Bug 5270.
```
  d512be7b
- Allow job_submit_plugin_modify() to change admin_comment field. · d939cb94
  Tim Wickberg authored Jun 14, 2018
```
Instead of unintentionally rejecting the update from a non-Administrator
if the job_submit plugin modified that field.

Bug 5306.
```
  d939cb94
12 Jun, 2018 3 commits
- MYSQL: Fix issue not handling all fields when loading an archive dump. · 0469a47b
  Danny Auble authored Jun 12, 2018
  
  0469a47b
- NEWS for last commit · d2cb0457
  Danny Auble authored Jun 12, 2018
```
Bug 5286
```
  d2cb0457
- Improve Lua package detection for older RHEL distros. · 1ebf9350
  Tim Wickberg authored Jun 11, 2018
```
RHEL 6 (and related) use lua as the package name, test
if that package exists with a version >= 5.1 if the other
tests have already failed.

Bug 5263.
```
  1ebf9350
08 Jun, 2018 2 commits
- Fix uninitialized variable in ipmi debug profile · c000733a
  Tim Wickberg authored Jun 08, 2018
```
And do not list each individual sensor reading but just the computed
sum of each one grouped by key.

Bug5274
```
  c000733a
- task/cray - search for "cpuset.mems" file then fall back to "mems" · 099a8ca1
  Morris Jette authored Jun 07, 2018
```
This is in anticipation of an upcoming change to the cgroup hierarchy
on a future CLE release.

Bug 5145.
```
  099a8ca1
06 Jun, 2018 1 commit

Don't allocate downed cloud nodes · be449407

Brian Christiansen authored Jun 05, 2018

which were marked down due to ResumeTimeout.

If a cloud node was marked down due to not responding by ResumeTimeout,
the code inadvertently added the node back to the avail_node_bitmap --
after being cleared by set_node_down_ptr(). The scheduler would then
attempt to allocate the node again, which would cause a loop of hitting
ResumeTimeout and allocating the downed node again.

Bug 5264

be449407

05 Jun, 2018 1 commit
- Add --without x11 option to rpmbuild in slurm.spec. · 5c5e10f8
  Killian authored Jun 04, 2018
```
Bug 5206.
```
  5c5e10f8
31 May, 2018 1 commit

Fix incomplete RESPONSE_[RESOURCE|JOB_PACK]_ALLOCATION building path. · 17392e76

Alejandro Sanchez authored May 31, 2018

There were two code paths building an allocation response by calling
its own static _build_alloc_msg() function:

1. src/slurmctld/proc_req.c
2. src/slurmctld/srun_comm.c

These two functions diverged and both had members that were not filled
in but were filled in the other. This patch makes it so we change the
signature of the one in proc_req.c to make it extern and then in
srun_comm.c we call this newly common function.

Also added cpu_freq_[min|max|gov] members in the common one since these
were the only members missing in proc_req.c function (the one in
srun_comm.c had more members missing, like all the ntasks_per*, account,
qos or resv_name).

Bug 4999.

17392e76

30 May, 2018 7 commits
- Start NEWS for v17.11.8 · b96f0826
  Tim Wickberg authored May 30, 2018
  
  b96f0826
- Start NEWS for v17.02.12 · ca88df8c
  Tim Wickberg authored May 30, 2018
  
  ca88df8c
- Fix insecure handling of job requested gid. · 033dc0d1
  Marshall Garey authored May 29, 2018
```
Only trust MUNGE signed values, unless the RPC was signed by
SlurmUser or root.

CVE-2018-10995.
```
  033dc0d1
- Validate gid and user_name values provided to slurmd up front. · df545955
  Tim Wickberg authored May 29, 2018
```
Do not defer until later, and do not potentially miss out on proper
validation of the user_name field which can lead to improper authentication
handling.

CVE-2018-10995.
```
  df545955
- Fix invalid free() if HWLOC does not return a malloc()'d string. · 83c92a4d
  Dominik Bartkiewicz authored May 30, 2018
```
Bug 5038.
```
  83c92a4d
- Fix race in pmixp_agent_start(). · 7c1dad6e
  Tim Wickberg authored May 29, 2018
```
Caused by pthread_cancel cleanup by commit e5f03971  in 17.11.6.

Bug 5181.
```
  7c1dad6e
- Fix deadlock in slurmstepd during shutdown. · 21ca33f5
  Tim Wickberg authored May 16, 2018
```
The race condition was created in a7c8964e in 17.11.6 when removing
the (unsafe) pthread_cancel code handling thread termination.

Bug 5164
```
  21ca33f5
24 May, 2018 1 commit

Notify srun and ctld when unkillable stepd exits · 956a808d

Brian Christiansen authored May 16, 2018

Commits f18390e8 and eed76f85 modified the stepd so that if the
stepd encountered an unkillable step timeout that the stepd would just
exit the stepd. If the stepd is a batch step then it would reply back
to the controller with a non-zero exit code which will drain the node.
But if an srun allocation/step were to get into the unkillable step
code, the steps wouldn't let the waiting srun or controller know about
the step going away -- leaving a hanging srun and job.

This patch enables the stepd to notify the waiting sruns and the ctld of
the stepd being done and drains the node for srun'ed alloction and/or
steps.

Bug 5164

956a808d

21 May, 2018 1 commit
- _post_qos_list() modifies global variables · da1eb7c7
  Dominik Bartkiewicz authored May 21, 2018
```
g_qos_count, g_qos_max_priority, must be call under qos write lock.

Bug 5159.
```
  da1eb7c7
17 May, 2018 1 commit
- Have sprio display jobs before eligible time when · 8782db29
  Danny Auble authored May 17, 2018
```
PriorityFlags=ACCRUE_ALWAYS is set.

Bug 5186
```
  8782db29
15 May, 2018 1 commit

PMIx - override default paths at configure time if --with-pmix is used. · 635c0232

Alejandro Sanchez authored May 15, 2018

Previously the default paths continued to be tested even when new ones
were requested. This had as a consequence that if any of the new paths
was the same as any of the default ones (i.e. /usr or /usr/local), the
configure script was incorrectly erroring out specifying that a version
of PMIx was already found in a previous path.

Bug 5168.

635c0232

10 May, 2018 1 commit

Fix different issues when requesting memory per cpu/node. · bf4cb0b1

Alejandro Sanchez authored May 10, 2018

First issue was identified on multi partition requests. job_limits_check()
was overriding the original memory requests, so the next partition
Slurm validating limits against was not using the original values. The
solution consists in adding three members to job_details struct to
preserve the original requests. This issue is reported in bug 4895.

Second issue was memory enforcement behavior being different depending on
job the request issued against a reservation or not.

Third issue had to do with the automatic adjustments Slurm did underneath
when the memory request exceeded the limit. These adjustments included
increasing pn_min_cpus (even incorrectly beyond the number of cpus
available on the nodes) or different tricks increasing cpus_per_task and
decreasing mem_per_cpu.

Fourth issue was identified when requesting the special case of 0 memory,
which was handled inside the select plugin after the partition validations
and thus that could be used to incorrectly bypass the limits.

Issues 2-4 were identified in bug 4976.

Patch also includes an entire refactor on how and when job memory is
is both set to default values (if not requested initially) and how and
when limits are validated.

Co-authored-by: Dominik Bartkiewicz <bart@schedmd.com>

bf4cb0b1