Commits · f49dc6cbf434e8a3ae951ee1a5c51d726ce03dfd · Manuel G. Marciani / ces_slurm_simulator

11 Aug, 2018 3 commits
- Improve scheduling when dealing with node_features that could have a · f49dc6cb
  Danny Auble authored Aug 10, 2018
```
boot delay.

Code to determine if we need to reboot or not based on node_features

Bug 5308
```
  f49dc6cb
- Fix invalid read caused by d2a1a96c. · 21d2ab6e
  Danny Auble authored Aug 10, 2018
```
Fixes bad casting caused by Tim on review.

Make one definitive copy of sort_part_tier() in partition_msg.c and use it.

Bug 3844, 5552.
```
  21d2ab6e
- Fix printing of node state "drained + reboot" · f23411bc
  Brian Christiansen authored Aug 10, 2018
```
and other drain + node state flags.

Bug 5505
```
  f23411bc
10 Aug, 2018 4 commits
- Make Users case insensitive in the database based on · dc48ea09
  Danny Auble authored Aug 10, 2018
```
Parameters=PreserveCaseUser in the slurmdbd.conf.
```
  dc48ea09
- Add cancel_reboot scontrol command · 9ad3ed74
  Brian Christiansen authored Aug 10, 2018
```
to cancel pending reboots on nodes.

Bug 5506
```
  9ad3ed74
- Add shutdown_on_reboot SlurmdParameter · ca84c810
  Brian Christiansen authored Aug 09, 2018
```
to control whether the Slurmd will shutdown itself down or not when a
reboot request is received.

Continuation of 54197d9e

Bug 5019
```
  ca84c810
- Start NEWS for v17.11.10 · d36faa22
  Tim Wickberg authored Aug 09, 2018
  
  d36faa22
09 Aug, 2018 2 commits
- Fix sinfo to print correct node state. · bf569fef
  Brian Christiansen authored Aug 09, 2018
```
Bug 5505
```
  bf569fef
- Fix multi partition jobs priority_array assignment order. · d2a1a96c
  Alejandro Sanchez authored Aug 06, 2018
```
Bug 3844.
```
  d2a1a96c
07 Aug, 2018 6 commits
- burst_buffer/cray - fix datawarp swap default pool overriding jobdw. · 4212426c
  Alejandro Sanchez authored Aug 07, 2018
```
Bug 5528.
```
  4212426c
- Make salloc handle node requests the same as sbatch. · 9c6afbd2
  Danny Auble authored Aug 07, 2018
```
Bug 5260
```
  9c6afbd2
- Avoid node config_list entry fragmentation. · be0abe8a
  Morris Jette authored Aug 07, 2018
```
Only split nodes here if a node_features plugin is in use.

Otherwise node fragmentation will occur if the node config has
CPUs specified but not CoresPerSocket and Sockets.

This could be avoided by filling out the node definition, but
adding this workaround for backwards compatiblity.

Bug 5039.
```
  be0abe8a
- Create TMPDIR after task prologs have run. · 0c560d67
  Marshall Garey authored Aug 07, 2018
```
Task prologs could set or modify this, so wait to create the
directory until after they've finished.

Bug 5367.
```
  0c560d67
- Add NEWS entry for the previous commits. · 6a5626c4
  Alejandro Sanchez authored Aug 07, 2018
```
Bug 4373.
```
  6a5626c4
- Fix scontrol -o show assoc output · d3d4e692
  Marshall Garey authored Aug 06, 2018
```
Bug 5481.
```
  d3d4e692
06 Aug, 2018 2 commits

Modify slurm_send_only_node_msg() to catch issues with socket. · 06582da8

Tim Wickberg authored Aug 06, 2018

There are subtle issues involved in treating a TCP transmission
as a unidirectional message delivery layer.

The original code path looks like: connect(), write(), close().
But Linux handles the write() and close() asynchronously behind the
scenes, and does not block until that write() has been ACK'd by the
remote end. So the write() and close() may succeed, even with data
still in flight. A communication error - and message loss - would
have been silently ignored, leading to unreliable message transmission.

Worse yet, one side of the connection would believe it sent the message,
while the receive side swears it never saw the packets. This leads to
infrequent and yet seemingly impossible data loss, and a very tough
bug to chase down.

This teardown code tries to force the connection to shut down in an
orderly manner, giving Slurm a chance to catch a connection problem
and the upstream calling path an opportunity to retransmit.

This teardown code is based on an approach described in Section 7.5
of "UNIX Network Programming" Volume 1 (Third Edition), specifically
the subsection regarding SO_LINGER. (And also covers why SO_LINGER is
not sufficent to prevent this issue.)

Bug 5164.

06582da8

Fix job array preemption in backfill scheduling. · 5efab599

Marshall Garey authored Aug 06, 2018

Previously only a single task of a job array could preempt during
backfill scheduling. This allows multiple tasks to preempt and have
resources reserved in backfill.

Bug 5405.

5efab599

04 Aug, 2018 1 commit

Fix 'srun -q' handling. · 1aaf99de

Jason Booth authored Aug 03, 2018

The getopt format string needs to handle an option here,
and the --help output had not been corrected after 99b2c4e8.

Bug 5522.

1aaf99de

03 Aug, 2018 1 commit
- Add TimelimitRaw sacct output field · 3b4b808d
  Jason Booth authored Aug 03, 2018
```
for displaying unformtted timelimit numbers.

Bug 5407
```
  3b4b808d
02 Aug, 2018 5 commits
- Add NEWS for v18.08.0rc1 · 96c0a5ff
  Tim Wickberg authored Aug 02, 2018
  
  96c0a5ff
- Add the use of a xml file to help performance when using hwloc. · c03753db
  Thomas Cadeau authored Aug 01, 2018
```
Bug 5094
```
  c03753db
- Consider resuming nodes in backfill · 324404de
  Brian Christiansen authored Jul 24, 2018
  
  324404de
- Add nextstate to scontrol reboot · c093689c
  Brian Christiansen authored Jul 18, 2018
```
to dictate what state of node is after reboot.
```
  c093689c
- Add ability to specify reason when rebooting nodes · f2b1898c
  Brian Christiansen authored Jul 17, 2018
  
  f2b1898c
31 Jul, 2018 1 commit
- Enable support for hwloc version 2.0.1 · cbe7015e
  Morris Jette authored Jul 31, 2018
```
Bug 5070
```
  cbe7015e
27 Jul, 2018 3 commits
- Remove erroneous unlock in acct_gather_energy/ipmi. · 4e8b77fa
  Danny Auble authored Jul 27, 2018
```
Bug 5468

This is a backport of commit cefc9ec1.
```
  4e8b77fa
- Now pmi library resides in contribs just as pmi2 one. · 2d735d94
  Felip Moll authored May 08, 2018
```
Bug 4918
```
  2d735d94
- Fix segfault in slurmctld when a job's node bitmap is NULL during a · fef07a40
  Dominik Bartkiewicz authored Jul 27, 2018
```
scheduling cycle.  Primarily caused by EnforcePartLimits=ALL.

Bug 5452
```
  fef07a40
24 Jul, 2018 1 commit
- Added database InnoDB settings verification to accounting storage plugin init · 0368fb33
  Broderick Gardner authored Jun 20, 2018
```
Bug 5248.
```
  0368fb33
19 Jul, 2018 4 commits

Start NEWS for v17.11.9 · 8b27b9c9
Tim Wickberg authored Jul 19, 2018

8b27b9c9
Add NEWS entry missed on prior commit. · 380abb0b
Tim Wickberg authored Jul 19, 2018

380abb0b

Add Delegate=yes to slurmd.service file to prevent systemd from interfering. · cecb39ff

Tim Wickberg authored Jul 19, 2018

Without Delegate=yes, systemd will "fix" the cgroup hierarchies whenever
'systemctl daemon-reload' is called, which will then remove any
restrictions placed on memory or device access for a given job.

This is a problem especially since 'systemctl daemon-reload' may be called
automatically by rpm/yum or a variety of config file mangers, leading to
jobs escaping from slurmd/slurmstepd's control.

This setting should work for systemd versions >= 205.
https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

Bug 5292.

cecb39ff

Fix segfault in hourly rollup · 346ce48b

Felip Moll authored Jul 19, 2018

When a job with time_end=0 and TRES null exists from an association that is
currently inside a reservation, the hourly rollup segfaults.

Bug 5143

346ce48b

18 Jul, 2018 4 commits

Prevent possible divide by zero in _validate_time_limit(). · 993ce884

Dominik Bartkiewicz authored Jul 18, 2018

As reported by Avalon Johnson on slurm-users
https://groups.google.com/forum/#!topic/slurm-users/BsMQ7Uk1PLw
Bug 5287.

993ce884

Fix printing off --hint options for sbatch, salloc · 17e6e23b
Brian Christiansen authored Jul 16, 2018
```
srun was already fixed in b7053bda (Bug 3294).

Bug 5126
```
17e6e23b
Add xstrstr() · 40abb764
Brian Christiansen authored Jul 16, 2018

40abb764

add job --gres-flags=disable-binding · aa61233b

Morris Jette authored Jul 17, 2018

Add salloc/sbatch/srun option of --gres-flags=disable-binding to disable
    filtering of CPUs with respect to generic resource locality. This option is
    currently required to use more CPUs than are bound to a GRES (i.e. if a GPU
    is bound to the CPUs on one socket, but resources on more than one socket
    are required to run the job). This option may permit a job to be allocated
    resources sooner than otherwise possible, but may result in lower job
    performance.
bug 5189

aa61233b

17 Jul, 2018 3 commits

Fix for formating when printing arrays in squeue · f1991701

Felip Moll authored Jul 17, 2018

When printing arrays in squeue and setting the SLURM_BITSTR_LEN variable to 0
or to NULL, the length of the output defaulted to 64, when the documentation
says it will default to "unlimited". This patch fixes this situation.

Bug 5440

f1991701

Fix incorrect locking in _init_power_save. · 1f8ede44

Marshall Garey authored Jul 16, 2018

Documented, and code reads as needing, the node lock. But these
were incorrectly set as the job locks.

Bug 5394.

1f8ede44

Fix incorrect locking in _slurm_rpc_resv_delete(). · 45e029c5

Dominik Bartkiewicz authored Jul 16, 2018

Needs the job write lock, as it may change job status not just node
status. Especially after commit 33e352a6.

Bug 5406.

45e029c5