Commits · 3ced4ec81f27284a9262c464d6f1c6a2f5d73078 · Manuel G. Marciani / ces_slurm_simulator

20 Jun, 2018 1 commit

Make job_start_data() multi partition aware on REQUEST_JOB_WILL_RUN. · 35a13703

Alejandro Sanchez authored Jun 20, 2018

Previously the function was only testing against the first partition in
the job_record. Now it detects if the job request is multi partition and
if so then loops through all of them until the job will run in any or
until the end of the list, returning the error code from the last one if
the job won't run in any partition.

Bug 5185

35a13703

19 Jun, 2018 2 commits

Don't enforce MaxQueryTimeRange with specific jobs · d41cb31a

Isaac Hartung authored Jun 19, 2018

When requesting specific jobids with sacct, the starttime of the request
is 0, which will cause the time range to be outside of the
MaxQueryTimeRange range -- if specified. When requesting specific
jobids, sacct should be able to find the job whenever it started --
unless confined to a smaller range with -S and/or -E.

Bug 5009

d41cb31a

Fix send_gids info. in NEWS and RELEASE_NOTES. · 4c49b4bb
Felip Moll authored Jun 19, 2018

4c49b4bb

18 Jun, 2018 1 commit
- MySQL - Prevent deadlock caused by archive logic locking reads. · 4c448fd7
  Danny Auble authored Jun 18, 2018
```
Specifically due to SELECT ... FOR UPDATE ones.

Bug 5086.
```
  4c448fd7
15 Jun, 2018 2 commits
- job_submit/lua - fix access into reservation table. · d512be7b
  Marshall Garey authored Jun 15, 2018
```
Bug 5270.
```
  d512be7b
- Allow job_submit_plugin_modify() to change admin_comment field. · d939cb94
  Tim Wickberg authored Jun 14, 2018
```
Instead of unintentionally rejecting the update from a non-Administrator
if the job_submit plugin modified that field.

Bug 5306.
```
  d939cb94
12 Jun, 2018 3 commits
- MYSQL: Fix issue not handling all fields when loading an archive dump. · 0469a47b
  Danny Auble authored Jun 12, 2018
  
  0469a47b
- NEWS for last commit · d2cb0457
  Danny Auble authored Jun 12, 2018
```
Bug 5286
```
  d2cb0457
- Improve Lua package detection for older RHEL distros. · 1ebf9350
  Tim Wickberg authored Jun 11, 2018
```
RHEL 6 (and related) use lua as the package name, test
if that package exists with a version >= 5.1 if the other
tests have already failed.

Bug 5263.
```
  1ebf9350
08 Jun, 2018 2 commits
- Fix uninitialized variable in ipmi debug profile · c000733a
  Tim Wickberg authored Jun 08, 2018
```
And do not list each individual sensor reading but just the computed
sum of each one grouped by key.

Bug5274
```
  c000733a
- task/cray - search for "cpuset.mems" file then fall back to "mems" · 099a8ca1
  Morris Jette authored Jun 07, 2018
```
This is in anticipation of an upcoming change to the cgroup hierarchy
on a future CLE release.

Bug 5145.
```
  099a8ca1
06 Jun, 2018 1 commit

Don't allocate downed cloud nodes · be449407

Brian Christiansen authored Jun 05, 2018

which were marked down due to ResumeTimeout.

If a cloud node was marked down due to not responding by ResumeTimeout,
the code inadvertently added the node back to the avail_node_bitmap --
after being cleared by set_node_down_ptr(). The scheduler would then
attempt to allocate the node again, which would cause a loop of hitting
ResumeTimeout and allocating the downed node again.

Bug 5264

be449407

05 Jun, 2018 1 commit
- Add --without x11 option to rpmbuild in slurm.spec. · 5c5e10f8
  Killian authored Jun 04, 2018
```
Bug 5206.
```
  5c5e10f8
31 May, 2018 1 commit

Fix incomplete RESPONSE_[RESOURCE|JOB_PACK]_ALLOCATION building path. · 17392e76

Alejandro Sanchez authored May 31, 2018

There were two code paths building an allocation response by calling
its own static _build_alloc_msg() function:

1. src/slurmctld/proc_req.c
2. src/slurmctld/srun_comm.c

These two functions diverged and both had members that were not filled
in but were filled in the other. This patch makes it so we change the
signature of the one in proc_req.c to make it extern and then in
srun_comm.c we call this newly common function.

Also added cpu_freq_[min|max|gov] members in the common one since these
were the only members missing in proc_req.c function (the one in
srun_comm.c had more members missing, like all the ntasks_per*, account,
qos or resv_name).

Bug 4999.

17392e76

30 May, 2018 7 commits
- Start NEWS for v17.11.8 · b96f0826
  Tim Wickberg authored May 30, 2018
  
  b96f0826
- Start NEWS for v17.02.12 · ca88df8c
  Tim Wickberg authored May 30, 2018
  
  ca88df8c
- Fix insecure handling of job requested gid. · 033dc0d1
  Marshall Garey authored May 29, 2018
```
Only trust MUNGE signed values, unless the RPC was signed by
SlurmUser or root.

CVE-2018-10995.
```
  033dc0d1
- Validate gid and user_name values provided to slurmd up front. · df545955
  Tim Wickberg authored May 29, 2018
```
Do not defer until later, and do not potentially miss out on proper
validation of the user_name field which can lead to improper authentication
handling.

CVE-2018-10995.
```
  df545955
- Fix invalid free() if HWLOC does not return a malloc()'d string. · 83c92a4d
  Dominik Bartkiewicz authored May 30, 2018
```
Bug 5038.
```
  83c92a4d
- Fix race in pmixp_agent_start(). · 7c1dad6e
  Tim Wickberg authored May 29, 2018
```
Caused by pthread_cancel cleanup by commit e5f03971  in 17.11.6.

Bug 5181.
```
  7c1dad6e
- Fix deadlock in slurmstepd during shutdown. · 21ca33f5
  Tim Wickberg authored May 16, 2018
```
The race condition was created in a7c8964e in 17.11.6 when removing
the (unsafe) pthread_cancel code handling thread termination.

Bug 5164
```
  21ca33f5
24 May, 2018 1 commit

Notify srun and ctld when unkillable stepd exits · 956a808d

Brian Christiansen authored May 16, 2018

Commits f18390e8 and eed76f85 modified the stepd so that if the
stepd encountered an unkillable step timeout that the stepd would just
exit the stepd. If the stepd is a batch step then it would reply back
to the controller with a non-zero exit code which will drain the node.
But if an srun allocation/step were to get into the unkillable step
code, the steps wouldn't let the waiting srun or controller know about
the step going away -- leaving a hanging srun and job.

This patch enables the stepd to notify the waiting sruns and the ctld of
the stepd being done and drains the node for srun'ed alloction and/or
steps.

Bug 5164

956a808d

21 May, 2018 1 commit
- _post_qos_list() modifies global variables · da1eb7c7
  Dominik Bartkiewicz authored May 21, 2018
```
g_qos_count, g_qos_max_priority, must be call under qos write lock.

Bug 5159.
```
  da1eb7c7
17 May, 2018 1 commit
- Have sprio display jobs before eligible time when · 8782db29
  Danny Auble authored May 17, 2018
```
PriorityFlags=ACCRUE_ALWAYS is set.

Bug 5186
```
  8782db29
15 May, 2018 1 commit

PMIx - override default paths at configure time if --with-pmix is used. · 635c0232

Alejandro Sanchez authored May 15, 2018

Previously the default paths continued to be tested even when new ones
were requested. This had as a consequence that if any of the new paths
was the same as any of the default ones (i.e. /usr or /usr/local), the
configure script was incorrectly erroring out specifying that a version
of PMIx was already found in a previous path.

Bug 5168.

635c0232

10 May, 2018 1 commit

Fix different issues when requesting memory per cpu/node. · bf4cb0b1

Alejandro Sanchez authored May 10, 2018

First issue was identified on multi partition requests. job_limits_check()
was overriding the original memory requests, so the next partition
Slurm validating limits against was not using the original values. The
solution consists in adding three members to job_details struct to
preserve the original requests. This issue is reported in bug 4895.

Second issue was memory enforcement behavior being different depending on
job the request issued against a reservation or not.

Third issue had to do with the automatic adjustments Slurm did underneath
when the memory request exceeded the limit. These adjustments included
increasing pn_min_cpus (even incorrectly beyond the number of cpus
available on the nodes) or different tricks increasing cpus_per_task and
decreasing mem_per_cpu.

Fourth issue was identified when requesting the special case of 0 memory,
which was handled inside the select plugin after the partition validations
and thus that could be used to incorrectly bypass the limits.

Issues 2-4 were identified in bug 4976.

Patch also includes an entire refactor on how and when job memory is
is both set to default values (if not requested initially) and how and
when limits are validated.

Co-authored-by: Dominik Bartkiewicz <bart@schedmd.com>

bf4cb0b1

09 May, 2018 9 commits
- Fix for possible slurmctld daemon abort with NULL pointer. · b67d7350
  Morris Jette authored May 09, 2018
```
If running without AccountingStorageEnforce but with the DBD and
it isn't up when starting the slurmctld you could get into a
corner case where you don't have a QOS list in the assoc_mgr.  Thus no
usage to transfer.

Bug 5156
```
  b67d7350
- Start NEWS for v17.11.7 · 8d8bc02b
  Tim Wickberg authored May 09, 2018
  
  8d8bc02b
- Fix typo in NEWS · e8892420
  Felip Moll authored May 09, 2018
  
  e8892420
- select/cons_res - Improve handling of --cores-per-socket. · 6de8c831
  Morris Jette authored May 09, 2018
```
Try to fill up each socket completely before moving into additional
sockets. This will minimize the number of sockets needed, improving
packing especially alongside MaxCPUsPerNode.

Bug 4995.
```
  6de8c831
- Fix misplaced NEWS entry for "select/backfill - fix issue with job resizing". · 2507e149
  Tim Wickberg authored May 09, 2018
```
My mistake on commit 602817c8.

Bug 4922.
```
  2507e149
- select/backfill - fix issue with job resizing · 602817c8
  Felip Moll authored May 09, 2018
```
Without this, gang scheduling would incorrectly kick in for
these jobs since active_resmap has not been updated appropriately.

Bug 4922.
```
  602817c8
- job_submit/lua - return an error if the script uses log.user() within job_modify. · 3f4cde9c
  Tim Wickberg authored May 08, 2018
```
Otherwise this will return the error message back to the next job submitter.

Bug 5106.
```
  3f4cde9c
- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code · f4f42d0f
  Tim Wickberg authored May 08, 2018
```
Bug 5106.
```
  f4f42d0f
- Prevent segfault on 'sprio' if a partition has recently been deleted. · de7eac9a
  Alejandro Sanchez authored May 08, 2018
```
job_ptr->part_ptr is NULL if the partition has been deleted.

Crash only happens with PriorityFlags=CALCULATE_RUNNING enabled.

Bug 5136.
```
  de7eac9a
08 May, 2018 3 commits

Prevent slurmd from launching steps if prolog fail · 3b029021
Brian Christiansen authored May 08, 2018
```
Bug 5146
```
3b029021

Fix issue with invalid protocol_version when using srun on ppc64. · 77d65f4f

Tim Wickberg authored May 08, 2018

Caused by a corrupted protocol_version field value being received
by the slurmstepd, as we cannot safely write/read a uint16_t across
the pipe as if it was an int.

Regression caused by commit 90b116c2.

Bug 5133.

77d65f4f

Fix checkpointing requeued jobs in a bad state · f9f395af

Brian Christiansen authored May 08, 2018

Requeued jobs are marked as PENDING|COMPLETING until the epilog checks
in. The issue is that if job_set_alloc_tres gets called while in the
PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If
this job then gets checkpointed in this state (PENDING|COMPLETING + no
tres_alloc_str) on startup the controller would crash because it
expected the job to have a tres_alloc_str/cnt when in the COMPLETING
state. This could be triggered if starting the controller without the
dbd up. When the dbd comes up, the assoc_cache_mgr calls
_update_job_tres() which calls job_set_alloc_tres. It could also be
triggered by adding new tres.

This most likely started happening in 17.11.5 because of commit
865b672f which introduced calling _update_job_tres() on each job
after the dbd comes up.

Bugs 5137,4522

f9f395af

04 May, 2018 1 commit

Increase # of tries when sending responses to srun · c1f3eb01

Brian Christiansen authored May 04, 2018

Only when the connection has timedout. If the connection is timing out,
consider increasing TCPTimeout in the slurm.conf

Bug 4574

c1f3eb01

03 May, 2018 1 commit
- pmixp: fixed the logging level for the send error message · e8b11443
  Boris Karasev authored May 03, 2018
```
Bug 5129.
```
  e8b11443