Commits · 23e815c69310ca24550f1dea598369fb92ada709 · Manuel G. Marciani / ces_slurm_simulator

04 Jul, 2018 1 commit

Fix read slurm.conf performance issues · 23e815c6

Felip Moll authored Jul 04, 2018

Cleaned up code that could've caused performance issues when reading config
and there was nodes with features defined.

bug4980

23e815c6

03 Jul, 2018 1 commit
- Added pending RPC stats to sdiag output · 6033f246
  Broderick Gardner authored Jul 02, 2018
```
bug 5337
```
  6033f246
26 Jun, 2018 2 commits
- Prevent reboot of busy KNL node when asking for inactive features. · d8c5379b
  Felip Moll authored Jun 26, 2018
```
When one asks for an inactive feature and also specifies the node with -w flag,
the node will be rebooted despite it may contain running jobs.

bug4821
```
  d8c5379b
- Reorder proctrack/task plugin load in the slurmstepd to match that of slurmd · 164da888
  Tim Wickberg authored Jun 25, 2018
```
and avoid race condition calling task before proctrack can introduce.

Bug 5319
```
  164da888
25 Jun, 2018 1 commit
- Add new job dependency type of "afterburstbuffer". The pending job will be · 3d4baee9
  Morris Jette authored Jun 25, 2018
```
delayed until the first job completes execution and it's burst buffer
stage-out is completed.

Bug 4675
```
  3d4baee9
22 Jun, 2018 2 commits
- Define alternate MailProg configuration parameter · 914ef205
  Morris Jette authored Jun 22, 2018
```
If MailProg is not configured and "/bin/mail" (the default) does
not exist, but "/usr/bin/mail" does exist then use "/usr/bin/mail"
as a default value.
```
  914ef205
- Prevent slurmctld from abort when attempting to set non-existing qos as def_qos_id · c9682e1a
  Dominik Bartkiewicz authored Jun 22, 2018
```
Bug 5159.
```
  c9682e1a
20 Jun, 2018 2 commits

Add partition name to will_run_response_msg · b577ab71
Morris Jette authored Jun 20, 2018

b577ab71

Make job_start_data() multi partition aware on REQUEST_JOB_WILL_RUN. · 35a13703

Alejandro Sanchez authored Jun 20, 2018

Previously the function was only testing against the first partition in
the job_record. Now it detects if the job request is multi partition and
if so then loops through all of them until the job will run in any or
until the end of the list, returning the error code from the last one if
the job won't run in any partition.

Bug 5185

35a13703

19 Jun, 2018 3 commits

Don't enforce MaxQueryTimeRange with specific jobs · d41cb31a

Isaac Hartung authored Jun 19, 2018

When requesting specific jobids with sacct, the starttime of the request
is 0, which will cause the time range to be outside of the
MaxQueryTimeRange range -- if specified. When requesting specific
jobids, sacct should be able to find the job whenever it started --
unless confined to a smaller range with -S and/or -E.

Bug 5009

d41cb31a

heterogeneous job scheduling fix · 118a73b6

Morris Jette authored Jun 19, 2018

For heterogeneous job component with required nodes, explicitly exclude
    those nodes from all other job components.

118a73b6

Fix send_gids info. in NEWS and RELEASE_NOTES. · 4c49b4bb
Felip Moll authored Jun 19, 2018

4c49b4bb

18 Jun, 2018 1 commit
- MySQL - Prevent deadlock caused by archive logic locking reads. · 4c448fd7
  Danny Auble authored Jun 18, 2018
```
Specifically due to SELECT ... FOR UPDATE ones.

Bug 5086.
```
  4c448fd7
15 Jun, 2018 2 commits
- job_submit/lua - fix access into reservation table. · d512be7b
  Marshall Garey authored Jun 15, 2018
```
Bug 5270.
```
  d512be7b
- Allow job_submit_plugin_modify() to change admin_comment field. · d939cb94
  Tim Wickberg authored Jun 14, 2018
```
Instead of unintentionally rejecting the update from a non-Administrator
if the job_submit plugin modified that field.

Bug 5306.
```
  d939cb94
14 Jun, 2018 1 commit
- reset job time limit for deadline · e5db58a9
  Felip Moll authored Jun 14, 2018
```
sched/backfill: Reset job time limit if needed for deadline scheduling.
bug 5183
```
  e5db58a9
13 Jun, 2018 1 commit

Remove AdminComment += syntax from 'scontrol update job'. · 1edd511f

Tim Wickberg authored Jun 13, 2018

I do not see a use for this syntax, especially given that it appends
an extra comma in between the two halves. Only allow the full string
to change to put this in line with the Comment handling.

Remove special handling of an identical AdminComment as well,
since the end result is unchanged, and this avoids a potentially
expensive xstrcmp call.

Bug 5306.

1edd511f

12 Jun, 2018 3 commits
- MYSQL: Fix issue not handling all fields when loading an archive dump. · 0469a47b
  Danny Auble authored Jun 12, 2018
  
  0469a47b
- NEWS for last commit · d2cb0457
  Danny Auble authored Jun 12, 2018
```
Bug 5286
```
  d2cb0457
- Improve Lua package detection for older RHEL distros. · 1ebf9350
  Tim Wickberg authored Jun 11, 2018
```
RHEL 6 (and related) use lua as the package name, test
if that package exists with a version >= 5.1 if the other
tests have already failed.

Bug 5263.
```
  1ebf9350
10 Jun, 2018 1 commit
- Add new job pending reason of "ReqNodeNotAvail, reserved for maintenance" · 4b9a7589
  Dominik Bartkiewicz authored Jun 10, 2018
```
bug 4987
```
  4b9a7589
08 Jun, 2018 2 commits
- Fix uninitialized variable in ipmi debug profile · c000733a
  Tim Wickberg authored Jun 08, 2018
```
And do not list each individual sensor reading but just the computed
sum of each one grouped by key.

Bug5274
```
  c000733a
- task/cray - search for "cpuset.mems" file then fall back to "mems" · 099a8ca1
  Morris Jette authored Jun 07, 2018
```
This is in anticipation of an upcoming change to the cgroup hierarchy
on a future CLE release.

Bug 5145.
```
  099a8ca1
07 Jun, 2018 2 commits

Add new ResumeFailProgram slurm.conf option · f1761be2
Brian Christiansen authored Jun 05, 2018
```
If defined, is called when a node failes to resume by ResumeTimeout.
```
f1761be2

Add ", with requeued tasks" to job array end email · 57104ecc

Isaac Hartung authored Jun 06, 2018

if any task in the array was requeued. This is a hint to use
"sacct --duplicates" to see the whole picture of the array job.

Bug 5105

57104ecc

06 Jun, 2018 3 commits

Add SetExecHost flag for cray burst buffers · f3ace3e5

Morris Jette authored Jun 06, 2018

burst_buffer.conf - Add SetExecHost flag to enable burst buffer access
    from the login node for interactive jobs.

f3ace3e5

Alter slurm_mktime() function to set tm_isdst to -1. · d6db076a

Alejandro Sanchez authored Jun 06, 2018

And remove the initialization before all the calls to the function.

It is non-functional and the motivation is more a preventive thing
so that if we ever use slurm_mktime() we know tm_isdst is consistently
set to -1.

Bug 5230.

d6db076a

Don't allocate downed cloud nodes · be449407

Brian Christiansen authored Jun 05, 2018

which were marked down due to ResumeTimeout.

If a cloud node was marked down due to not responding by ResumeTimeout,
the code inadvertently added the node back to the avail_node_bitmap --
after being cleared by set_node_down_ptr(). The scheduler would then
attempt to allocate the node again, which would cause a loop of hitting
ResumeTimeout and allocating the downed node again.

Bug 5264

be449407

05 Jun, 2018 1 commit
- Add --without x11 option to rpmbuild in slurm.spec. · 5c5e10f8
  Killian authored Jun 04, 2018
```
Bug 5206.
```
  5c5e10f8
04 Jun, 2018 1 commit
- Add "Links" parameter to gres.conf configuration file. · 4d83d8ed
  Morris Jette authored Jun 04, 2018
  
  4d83d8ed
02 Jun, 2018 1 commit

Fix srun to return highest signal of any task · 622f29f7

Brian Christiansen authored May 31, 2018

srun would not return an exit code if a previous task exited before a
latter task exited with a signal.

If multiple tasks exit with a signal, srun returns the highest signal.

Partially reverts commit 04b449e1 -- the setting of local_global_rc
to NO_VAL as srun doesn't need to know whether it's been set or not
anymore. srun always sets the signal if a task exited with a signal.

Bug 5083

622f29f7

31 May, 2018 1 commit

Fix incomplete RESPONSE_[RESOURCE|JOB_PACK]_ALLOCATION building path. · 17392e76

Alejandro Sanchez authored May 31, 2018

There were two code paths building an allocation response by calling
its own static _build_alloc_msg() function:

1. src/slurmctld/proc_req.c
2. src/slurmctld/srun_comm.c

These two functions diverged and both had members that were not filled
in but were filled in the other. This patch makes it so we change the
signature of the one in proc_req.c to make it extern and then in
srun_comm.c we call this newly common function.

Also added cpu_freq_[min|max|gov] members in the common one since these
were the only members missing in proc_req.c function (the one in
srun_comm.c had more members missing, like all the ntasks_per*, account,
qos or resv_name).

Bug 4999.

17392e76

30 May, 2018 7 commits
- Start NEWS for v17.11.8 · b96f0826
  Tim Wickberg authored May 30, 2018
  
  b96f0826
- Start NEWS for v17.02.12 · ca88df8c
  Tim Wickberg authored May 30, 2018
  
  ca88df8c
- Fix insecure handling of job requested gid. · 033dc0d1
  Marshall Garey authored May 29, 2018
```
Only trust MUNGE signed values, unless the RPC was signed by
SlurmUser or root.

CVE-2018-10995.
```
  033dc0d1
- Validate gid and user_name values provided to slurmd up front. · df545955
  Tim Wickberg authored May 29, 2018
```
Do not defer until later, and do not potentially miss out on proper
validation of the user_name field which can lead to improper authentication
handling.

CVE-2018-10995.
```
  df545955
- Fix invalid free() if HWLOC does not return a malloc()'d string. · 83c92a4d
  Dominik Bartkiewicz authored May 30, 2018
```
Bug 5038.
```
  83c92a4d
- Fix race in pmixp_agent_start(). · 7c1dad6e
  Tim Wickberg authored May 29, 2018
```
Caused by pthread_cancel cleanup by commit e5f03971  in 17.11.6.

Bug 5181.
```
  7c1dad6e
- Fix deadlock in slurmstepd during shutdown. · 21ca33f5
  Tim Wickberg authored May 16, 2018
```
The race condition was created in a7c8964e in 17.11.6 when removing
the (unsafe) pthread_cancel code handling thread termination.

Bug 5164
```
  21ca33f5
24 May, 2018 1 commit

Notify srun and ctld when unkillable stepd exits · 956a808d

Brian Christiansen authored May 16, 2018

Commits f18390e8 and eed76f85 modified the stepd so that if the
stepd encountered an unkillable step timeout that the stepd would just
exit the stepd. If the stepd is a batch step then it would reply back
to the controller with a non-zero exit code which will drain the node.
But if an srun allocation/step were to get into the unkillable step
code, the steps wouldn't let the waiting srun or controller know about
the step going away -- leaving a hanging srun and job.

This patch enables the stepd to notify the waiting sruns and the ctld of
the stepd being done and drains the node for srun'ed alloction and/or
steps.

Bug 5164

956a808d