Commits · f23411bc96be8055e3f295270c4a73709ce574b4 · Manuel G. Marciani / ces_slurm_simulator

11 Aug, 2018 1 commit
- Fix printing of node state "drained + reboot" · f23411bc
  Brian Christiansen authored Aug 10, 2018
```
and other drain + node state flags.

Bug 5505
```
  f23411bc
10 Aug, 2018 2 commits
- Start NEWS for v17.11.10 · d36faa22
  Tim Wickberg authored Aug 09, 2018
  
  d36faa22
- Update META for v17.11.9. · 49514f58
  Tim Wickberg authored Aug 09, 2018
```
Update slurm.spec and slurm.spec-legacy as well.
```
  49514f58
09 Aug, 2018 3 commits
- Fix sinfo to print correct node state. · bf569fef
  Brian Christiansen authored Aug 09, 2018
```
Bug 5505
```
  bf569fef
- Continuation of 06582da8. · 70655460
  Tim Wickberg authored Aug 09, 2018
```
Bug 5164.

Without this patch the slurmctld would send a finishing message to srun
which may already be torn down.  This prevents printing a benign
error message.
```
  70655460
- Fix multi partition jobs priority_array assignment order. · d2a1a96c
  Alejandro Sanchez authored Aug 06, 2018
```
Bug 3844.
```
  d2a1a96c
08 Aug, 2018 7 commits
- Prevent tests from running when OverTimeLimit set · 2c28c2c2
  Marshall Garey authored Aug 08, 2018
```
Bug 5526
```
  2c28c2c2
- Add get_over_time_limit global expect function · 6a88fe57
  Marshall Garey authored Aug 08, 2018
```
Bug 5526
```
  6a88fe57
- Update tests to use global function · 32497d34
  Marshall Garey authored Aug 08, 2018
```
Bug 5171
```
  32497d34
- Ensure MaxTime is ULIMITED for tests · b2c188d2
  Marshall Garey authored Aug 08, 2018
```
so that the tests won't fail due to trying to set time limits greater
than MaxTime.

Bug 5171
```
  b2c188d2
- Exit from tests using single function · 57a31370
  Marshall Garey authored Aug 08, 2018
```
Bug 5171
```
  57a31370
- Ensure slurmuser in tests · 527b85b2
  Marshall Garey authored Aug 08, 2018
```
in order to be able to set a partition's MaxTime

Bug 5171
```
  527b85b2
- Add set_partition_maximum_time_limit to globals · 5f7aac84
  Marshall Garey authored Aug 08, 2018
```
To set a partition's MaxTime

Bug 5171
```
  5f7aac84
07 Aug, 2018 3 commits

burst_buffer/cray - fix datawarp swap default pool overriding jobdw. · 4212426c
Alejandro Sanchez authored Aug 07, 2018
```
Bug 5528.
```
4212426c

Avoid node config_list entry fragmentation. · be0abe8a

Morris Jette authored Aug 07, 2018

Only split nodes here if a node_features plugin is in use.

Otherwise node fragmentation will occur if the node config has
CPUs specified but not CoresPerSocket and Sockets.

This could be avoided by filling out the node definition, but
adding this workaround for backwards compatiblity.

Bug 5039.

be0abe8a

Create TMPDIR after task prologs have run. · 0c560d67

Marshall Garey authored Aug 07, 2018

Task prologs could set or modify this, so wait to create the
directory until after they've finished.

Bug 5367.

0c560d67

06 Aug, 2018 3 commits

Change debug messages() in _launch_handler(). · c3c78acd

Tim Wickberg authored Aug 06, 2018

After changes to slurm_send_only_node_msg(), this message is much
more likely to appear on systems with overloaded interconnects since
that connection handling code may end up retransmitting messages
that were actually received (but that the transmit side could not
verify were delivered successfully).

As the error() message stated, this isn't actually an error, and
the code will proceed happily past this point. So drop the debug
level, and remove the surrealist "this is not an error" part.

Bug 5164.

c3c78acd

Modify slurm_send_only_node_msg() to catch issues with socket. · 06582da8

Tim Wickberg authored Aug 06, 2018

There are subtle issues involved in treating a TCP transmission
as a unidirectional message delivery layer.

The original code path looks like: connect(), write(), close().
But Linux handles the write() and close() asynchronously behind the
scenes, and does not block until that write() has been ACK'd by the
remote end. So the write() and close() may succeed, even with data
still in flight. A communication error - and message loss - would
have been silently ignored, leading to unreliable message transmission.

Worse yet, one side of the connection would believe it sent the message,
while the receive side swears it never saw the packets. This leads to
infrequent and yet seemingly impossible data loss, and a very tough
bug to chase down.

This teardown code tries to force the connection to shut down in an
orderly manner, giving Slurm a chance to catch a connection problem
and the upstream calling path an opportunity to retransmit.

This teardown code is based on an approach described in Section 7.5
of "UNIX Network Programming" Volume 1 (Third Edition), specifically
the subsection regarding SO_LINGER. (And also covers why SO_LINGER is
not sufficent to prevent this issue.)

Bug 5164.

06582da8

Retransmit on all errors. · a572d5d6
Tim Wickberg authored Aug 06, 2018
```
Bug 5164.
```
a572d5d6

04 Aug, 2018 1 commit

Fix 'srun -q' handling. · 1aaf99de

Jason Booth authored Aug 03, 2018

The getopt format string needs to handle an option here,
and the --help output had not been corrected after 99b2c4e8.

Bug 5522.

1aaf99de

31 Jul, 2018 1 commit
- Enable support for hwloc version 2.0.1 · cbe7015e
  Morris Jette authored Jul 31, 2018
```
Bug 5070
```
  cbe7015e
27 Jul, 2018 2 commits
- Remove erroneous unlock in acct_gather_energy/ipmi. · 4e8b77fa
  Danny Auble authored Jul 27, 2018
```
Bug 5468

This is a backport of commit cefc9ec1.
```
  4e8b77fa
- Fix segfault in slurmctld when a job's node bitmap is NULL during a · fef07a40
  Dominik Bartkiewicz authored Jul 27, 2018
```
scheduling cycle.  Primarily caused by EnforcePartLimits=ALL.

Bug 5452
```
  fef07a40
24 Jul, 2018 1 commit
- Fix spelling in man page · 074b0ea0
  Brian Christiansen authored Jul 24, 2018
  
  074b0ea0
19 Jul, 2018 7 commits

Start NEWS for v17.11.9 · 8b27b9c9
Tim Wickberg authored Jul 19, 2018

8b27b9c9
Update META for v17.11.8. · 07ad0727
Tim Wickberg authored Jul 19, 2018
```
Update slurm.spec and slurm.spec-legacy as well.
```
07ad0727
Add NEWS entry missed on prior commit. · 380abb0b
Tim Wickberg authored Jul 19, 2018

380abb0b

Use one macro for all listen() backlog arguments. · b039ba24

Tim Wickberg authored Jul 19, 2018

The lower limit of 1024 may be too short for srun with large-scale
jobs, and lead to problems processing task completion messages in a
timely fashion.

Rather than adjust that, unify the two separate macros into
SLURM_DEFAULT_LISTEN_BACKLOG with the higer 4096 value.

Bug 5164.

b039ba24

Add Delegate=yes to slurmd.service file to prevent systemd from interfering. · cecb39ff

Tim Wickberg authored Jul 19, 2018

Without Delegate=yes, systemd will "fix" the cgroup hierarchies whenever
'systemctl daemon-reload' is called, which will then remove any
restrictions placed on memory or device access for a given job.

This is a problem especially since 'systemctl daemon-reload' may be called
automatically by rpm/yum or a variety of config file mangers, leading to
jobs escaping from slurmd/slurmstepd's control.

This setting should work for systemd versions >= 205.
https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

Bug 5292.

cecb39ff

Merge branch 'slurm-17.02' into slurm-17.11 · 954830f5
Tim Wickberg authored Jul 19, 2018

954830f5

Fix segfault in hourly rollup · 346ce48b

Felip Moll authored Jul 19, 2018

When a job with time_end=0 and TRES null exists from an association that is
currently inside a reservation, the hourly rollup segfaults.

Bug 5143

346ce48b

18 Jul, 2018 5 commits
- Prevent possible divide by zero in _validate_time_limit(). · 993ce884
  Dominik Bartkiewicz authored Jul 18, 2018
```
As reported by Avalon Johnson on slurm-users
https://groups.google.com/forum/#!topic/slurm-users/BsMQ7Uk1PLw
Bug 5287.
```
  993ce884
- Fix grammar in RebootProgram docs · 72b4f3c4
  Brian Christiansen authored Jul 17, 2018
  
  72b4f3c4
- Fix printing off --hint options for sbatch, salloc · 17e6e23b
  Brian Christiansen authored Jul 16, 2018
```
srun was already fixed in b7053bda (Bug 3294).

Bug 5126
```
  17e6e23b
- Add xstrstr() · 40abb764
  Brian Christiansen authored Jul 16, 2018
  
  40abb764
- Docs - Change to using 'show engines' for verifying InnoDB availability. · 79fd5e83
  Broderick Gardner authored Jul 17, 2018
```
'have_innodb' is deprecated.

Bug 5317.
```
  79fd5e83
17 Jul, 2018 4 commits

Fix for formating when printing arrays in squeue · f1991701

Felip Moll authored Jul 17, 2018

When printing arrays in squeue and setting the SLURM_BITSTR_LEN variable to 0
or to NULL, the length of the output defaulted to 64, when the documentation
says it will default to "unlimited". This patch fixes this situation.

Bug 5440

f1991701

Docs - fix reference to enable_user_top option. · 29cc55b7
Marshall Garey authored Jul 16, 2018
```
Logic was switched around in 17.11, enable_user_top is now the
correct option.

Bug 5165.
```
29cc55b7

Docs - Clarify MPI apps don't work with hetjobs in 17.11. · 3060b62e

Alejandro Sanchez authored Jul 16, 2018

This is not working reliably even when setting
SchedulerParameters=enable_hetero_steps and/or using OpenMPI with Slurm's
mpi/pmi2, as it was previously documented.

Bug 5309.

3060b62e

Fix incorrect locking in _init_power_save. · 1f8ede44

Marshall Garey authored Jul 16, 2018

Documented, and code reads as needing, the node lock. But these
were incorrectly set as the job locks.

Bug 5394.

1f8ede44