Commits · 59be3d836d9b828484714e147a491b6b0458381b · Manuel G. Marciani / ces_slurm_simulator

10 May, 2018 1 commit

Fix different issues when requesting memory per cpu/node. · bf4cb0b1

Alejandro Sanchez authored May 10, 2018

First issue was identified on multi partition requests. job_limits_check()
was overriding the original memory requests, so the next partition
Slurm validating limits against was not using the original values. The
solution consists in adding three members to job_details struct to
preserve the original requests. This issue is reported in bug 4895.

Second issue was memory enforcement behavior being different depending on
job the request issued against a reservation or not.

Third issue had to do with the automatic adjustments Slurm did underneath
when the memory request exceeded the limit. These adjustments included
increasing pn_min_cpus (even incorrectly beyond the number of cpus
available on the nodes) or different tricks increasing cpus_per_task and
decreasing mem_per_cpu.

Fourth issue was identified when requesting the special case of 0 memory,
which was handled inside the select plugin after the partition validations
and thus that could be used to incorrectly bypass the limits.

Issues 2-4 were identified in bug 4976.

Patch also includes an entire refactor on how and when job memory is
is both set to default values (if not requested initially) and how and
when limits are validated.

Co-authored-by: Dominik Bartkiewicz <bart@schedmd.com>

bf4cb0b1

09 May, 2018 9 commits
- Fix for possible slurmctld daemon abort with NULL pointer. · b67d7350
  Morris Jette authored May 09, 2018
```
If running without AccountingStorageEnforce but with the DBD and
it isn't up when starting the slurmctld you could get into a
corner case where you don't have a QOS list in the assoc_mgr.  Thus no
usage to transfer.

Bug 5156
```
  b67d7350
- Start NEWS for v17.11.7 · 8d8bc02b
  Tim Wickberg authored May 09, 2018
  
  8d8bc02b
- Fix typo in NEWS · e8892420
  Felip Moll authored May 09, 2018
  
  e8892420
- select/cons_res - Improve handling of --cores-per-socket. · 6de8c831
  Morris Jette authored May 09, 2018
```
Try to fill up each socket completely before moving into additional
sockets. This will minimize the number of sockets needed, improving
packing especially alongside MaxCPUsPerNode.

Bug 4995.
```
  6de8c831
- Fix misplaced NEWS entry for "select/backfill - fix issue with job resizing". · 2507e149
  Tim Wickberg authored May 09, 2018
```
My mistake on commit 602817c8.

Bug 4922.
```
  2507e149
- select/backfill - fix issue with job resizing · 602817c8
  Felip Moll authored May 09, 2018
```
Without this, gang scheduling would incorrectly kick in for
these jobs since active_resmap has not been updated appropriately.

Bug 4922.
```
  602817c8
- job_submit/lua - return an error if the script uses log.user() within job_modify. · 3f4cde9c
  Tim Wickberg authored May 08, 2018
```
Otherwise this will return the error message back to the next job submitter.

Bug 5106.
```
  3f4cde9c
- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code · f4f42d0f
  Tim Wickberg authored May 08, 2018
```
Bug 5106.
```
  f4f42d0f
- Prevent segfault on 'sprio' if a partition has recently been deleted. · de7eac9a
  Alejandro Sanchez authored May 08, 2018
```
job_ptr->part_ptr is NULL if the partition has been deleted.

Crash only happens with PriorityFlags=CALCULATE_RUNNING enabled.

Bug 5136.
```
  de7eac9a
08 May, 2018 3 commits

Prevent slurmd from launching steps if prolog fail · 3b029021
Brian Christiansen authored May 08, 2018
```
Bug 5146
```
3b029021

Fix issue with invalid protocol_version when using srun on ppc64. · 77d65f4f

Tim Wickberg authored May 08, 2018

Caused by a corrupted protocol_version field value being received
by the slurmstepd, as we cannot safely write/read a uint16_t across
the pipe as if it was an int.

Regression caused by commit 90b116c2.

Bug 5133.

77d65f4f

Fix checkpointing requeued jobs in a bad state · f9f395af

Brian Christiansen authored May 08, 2018

Requeued jobs are marked as PENDING|COMPLETING until the epilog checks
in. The issue is that if job_set_alloc_tres gets called while in the
PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If
this job then gets checkpointed in this state (PENDING|COMPLETING + no
tres_alloc_str) on startup the controller would crash because it
expected the job to have a tres_alloc_str/cnt when in the COMPLETING
state. This could be triggered if starting the controller without the
dbd up. When the dbd comes up, the assoc_cache_mgr calls
_update_job_tres() which calls job_set_alloc_tres. It could also be
triggered by adding new tres.

This most likely started happening in 17.11.5 because of commit
865b672f which introduced calling _update_job_tres() on each job
after the dbd comes up.

Bugs 5137,4522

f9f395af

04 May, 2018 1 commit

Increase # of tries when sending responses to srun · c1f3eb01

Brian Christiansen authored May 04, 2018

Only when the connection has timedout. If the connection is timing out,
consider increasing TCPTimeout in the slurm.conf

Bug 4574

c1f3eb01

03 May, 2018 5 commits

pmixp: fixed the logging level for the send error message · e8b11443
Boris Karasev authored May 03, 2018
```
Bug 5129.
```
e8b11443

slurmd - fix ABRT on SIGINT after reconfigure with MemSpecLimit set. · ca304478

Alejandro Sanchez authored May 03, 2018

Use setenv() instead of setenvfs(), since setenvfs() memory allocation
is implemented with xmalloc() and fini_setproctitle() (which is called
on reconfigure) free's the memory with free(), leading to a:
"free(): invalid size" malloc_printerr error.

Continuation of dce83a23.

Bug 5095.

ca304478

Documentation added for GRES types vs limits · c2c06468

Felip Moll authored May 03, 2018

Due to current design the job limits are checked before the allocation is
made when one specifies a generic gres and a specific gres type is configured.
The workaround for now is to define a job submit plugin to control the user
request and succesfully apply limits.

Bug 4767

c2c06468

Calculate TRES Billing values at submission · 00643fd4

Brian Christiansen authored Apr 30, 2018

This allows job limits to be enforced at submission -- with QOS
DenyOnLimit flag. Note that the values could be expanded at schedule
time (e.g. request one task but get all cpus on a core). The expanded
values are considered when scheduling.

00643fd4

scontrol create/update part w/ TRESBillingWeights · 17c90380
Isaac Hartung authored Feb 09, 2018
```
Bug 4274
```
17c90380

02 May, 2018 6 commits
- Disable reservation overlap check on job submission. · 8c333be7
  Dominik Bartkiewicz authored May 02, 2018
```
Bug 4960.
```
  8c333be7
- Fix fatal in controller when loading completed trigger. · 972c1cf8
  Dominik Bartkiewicz authored May 02, 2018
```
Bug 4887.
```
  972c1cf8
- Remove unsafe use of pthread_cancel in pmix plugin. · e5f03971
  Tim Wickberg authored May 02, 2018
```
Can lead to deadlock within malloc depending on the exact timing.

Rework thread startup and shutdown code so pthread_cancel is not
needed.

Bug 5119, 5103.
```
  e5f03971
- Don't tear down a BB if a node fails and --no-kill or resize of a job · f8b11ddc
  Tim Wickberg authored May 02, 2018
```
happens.

Bug 5108
```
  f8b11ddc
- Revert "Don't tear down a BB if a node fails and --no-kill or resize of a job" · 02bdc9a9
  Danny Auble authored May 02, 2018
```
This reverts commit de5a4da2.
```
  02bdc9a9
- Don't tear down a BB if a node fails and --no-kill or resize of a job · de5a4da2
  Danny Auble authored May 02, 2018
```
happens.

Bug 5108
```
  de5a4da2
01 May, 2018 1 commit

Fix total TRES Billing on partitions. · 3686dd9c

Danny Auble authored May 01, 2018

Turns out the partititon's billing tres was working off the sum of
the node_ptrs which contain the max of all partitions they are in.

This isn't correct since each partition's billing can be different.

Set it correctly here.

3686dd9c

30 Apr, 2018 3 commits

Remove unsafe use of pthread_cancel() in slurmstepd. · a7c8964e

Tim Wickberg authored Apr 30, 2018

These functions are not async-cancel-safe, and cannot safely be cancelled.
This leads to potential deadlock, either between our own locks, or deep
inside glibc when the thread held a malloc arena lock when canceled.

Replace with pthread_signal to the appropriate cond to
wake threads up at the appropriate time instead.

Bug 5103.

a7c8964e

Increase duration of extern step sleep command. · 140758ca
Marshall Garey authored Apr 30, 2018
```
Otherwise the extern step will disappear after 11.5 days.

Bug 5000.
```
140758ca
Moved creating gres_detail_str from _pack_job_gres · e2312e03
Dominik Bartkiewicz authored Apr 30, 2018
```
to be sure if it is created under job write lock.

Bug 4901
```
e2312e03

28 Apr, 2018 2 commits
- Recognize cloud nodes that are booted out of band · ece205ed
  Brian Christiansen authored Apr 23, 2018
```
Bug 5053
```
  ece205ed
- Force power_down of cloud node · 8ebf36eb
  Brian Christiansen authored Apr 23, 2018
```
This allows the suspend script to be triggered even if Slurm has the
node(s) in a power_save state.

Bug 5053
```
  8ebf36eb
26 Apr, 2018 1 commit

Explicitly shutdown slurmd when a node reboot is triggered. · 54197d9e

Marshall Garey authored Apr 26, 2018

Just in case reboot_program doesn't actually turn this node off
for some reason, at least stopping slurmd explicitly will keep
the node offline until someone intervenes.

Bug 5019.

54197d9e

25 Apr, 2018 2 commits

Add SlurmctldAddr configuration parameter · 89f3393d

Morris Jette authored Apr 25, 2018

Add configuration paramerers SlurmctldAddr for use with virtual IP to manage
    backup slurmctld daemons.
bug 4768

89f3393d

Add slurmctld primary/backup program logic · 2d1b2d9f

Morris Jette authored Apr 25, 2018

Add configuration paramerers SlurmctldPrimaryOnProg and
    SlurmctldPrimaryOffProg, which define programs to execute when a slurmctld
    daemon becomes the primary server or goes from primary to backup mode.
bug 4768

2d1b2d9f

23 Apr, 2018 2 commits

Fix build when lz4 is in a non-standard location. · 2f81322e
Morris Jette authored Apr 23, 2018

2f81322e

Fix error code and scheduling problem for --exclusive=[user|mcs]. · 9cf41301

Morris Jette authored Apr 23, 2018

When any of these --exclusive modes couldn't be satisfied, Slurm was
returning an incorrect ESLURM_NODE_NOT_AVAIL, having as a consequence
scheduling problems as described in the bug. The fix makes it so the
error code is properly set to ESLURM_NODES_BUSY, fixing also the
scheduling problems and working over the correct share_node_bitmap.

Continuation of commits from bug 4932:
e2a14b8d
fc4e5ac9

Bug 5047.

9cf41301

19 Apr, 2018 3 commits

Fix incorrect error thrown when cancelling part of a job array. · 8432f9f6
Marshall Garey authored Apr 19, 2018
```
Fix an issue in the bit manipulation log introduced in commit 892ffa89.

Bug 4997.
```
8432f9f6

Fix 'squeue -o %s' on Cray systems. · d3398004

Tim Wickberg authored Apr 19, 2018

Replace select_p_select_jobinfo_sprint() with the same NO-OP
that the other plugins (except alps and bluegene) have implemented.

Bug 5077.

d3398004

Fix for update job function. · 164d4878

Danny Auble authored Apr 19, 2018

Time limit was incorrectly changed when swapping qos. Now we check
qos/assoc/partition all together and deny the chang if not consistent.
Fixed how we check coordinator permissions. Rearranged the function to update
QOS before updating partition in order to enforce AllowQOS and DenyQOS options.
One memory leak and some comments fixed too.

Bug 4685

164d4878

17 Apr, 2018 1 commit

Make UnavailableNodes value in job reason be correct for each job · fc4e5ac9

Morris Jette authored Apr 17, 2018

1. Identifies nodes which are unavailable to a specific job,
adding a call to filter_by_node_owner() in select_nodes()
 where the node list is generated.
2. Removes the "unavail_node_str" argument to select_nodes()
as it is no longer useful. This string originally was originally
generated once at the start of the job scheduling logic for all jobs,
but since each job can have a different set of
 unavailable nodes (dedicated to user, group, etc.)
so the same string for all jobs can be misleading.

Bug 4932.

fc4e5ac9