Commits · e3f45f55268274a5bbb2a960e5526f13e28a1599 · Manuel G. Marciani / ces_slurm_simulator

23 Jul, 2018 2 commits

Remove slurm_shutdown_msg_conn() in favor of close(). · e3f45f55

Tim Wickberg authored Jul 23, 2018

This also cleans up several locations that could try to repeatedly call
close(). See prior commit for further details on why that is best avoided.

e3f45f55

Do not retry close() syscall on error. · a1e5c264

Tim Wickberg authored Jul 23, 2018

Quoting part of the close() man page:

Retrying the close() after a failure return is the wrong thing to do,
since this may cause a reused file descriptor from another thread to be
closed. This can occur because the Linux kernel always releases the
file descriptor early in the close operation, freeing it for reuse; the
steps that may return an error, such as flushing data to the filesystem
or device, occur only later in the close operation.

a1e5c264

21 Jul, 2018 4 commits

x11 - cleanup error path and overwrite DISPLAY if X11 forwarding failed. · a51d1600

Tim Wickberg authored Jul 21, 2018

Set DISPLAY to SLURM_X11_SETUP_FAILED to make it clear that the
tunnel setup has failed. This at least gives the user a hint as to
why their X11 apps aren't working, although further refinement should be
done later:

tim@zoidberg:~$ srun --x11 xclock
Error: Can't open display: SLURM_X11_SETUP_FAILED
srun: error: node001: task 0: Exited with exit code 1

a51d1600

Add X11Parameters=local_xauthority option. · 2a58e3e2

Tim Wickberg authored Jul 21, 2018

Creates a local XAUTHORITY file in TmpFS on the node, and deletes
it upon job termination. This avoids file locking contention on
~/.Xauthority in the users home directory.

Bug 3647.

2a58e3e2

Send tmpfs to slurmstepd as part of pack_slurmd_conf_lite(). · 70e893b8
Tim Wickberg authored Jul 21, 2018

70e893b8

X11 forwarding subsystem - add plumbing to permit a temporary XAUTHORITY file · 3b7d1625

Tim Wickberg authored Jul 21, 2018

Build out sufficient plumbing such that a temporary XAUTHORITY file
can be used that is local to the compute node, thus avoiding lock
contention on ~/.Xauthority on parallel filesystems.

This commit only includes the requisite plumbing to pass this around.

If this is not used, a null string results, and the XAUTHORITY env var
will not be forced into the user environment.

Add support and fix the modified API call in pam_slurm_adopt while here.

Bug 3647.

3b7d1625

20 Jul, 2018 3 commits
- Modify test to eliminate vestigial file · 3ac05af0
  Morris Jette authored Jul 20, 2018
  
  3ac05af0
- cons_tres: enforce job max cpu limit · 7440f1ad
  Morris Jette authored Jul 19, 2018
  
  7440f1ad
- Merge branch 'slurm-17.11' · 4f04764a
  Tim Wickberg authored Jul 19, 2018
  
  4f04764a
19 Jul, 2018 12 commits

Start NEWS for v17.11.9 · 8b27b9c9
Tim Wickberg authored Jul 19, 2018

8b27b9c9
Update META for v17.11.8. · 07ad0727
Tim Wickberg authored Jul 19, 2018
```
Update slurm.spec and slurm.spec-legacy as well.
```
07ad0727
Add NEWS entry missed on prior commit. · 380abb0b
Tim Wickberg authored Jul 19, 2018

380abb0b

Use one macro for all listen() backlog arguments. · b039ba24

Tim Wickberg authored Jul 19, 2018

The lower limit of 1024 may be too short for srun with large-scale
jobs, and lead to problems processing task completion messages in a
timely fashion.

Rather than adjust that, unify the two separate macros into
SLURM_DEFAULT_LISTEN_BACKLOG with the higer 4096 value.

Bug 5164.

b039ba24

Add Delegate=yes to slurmd.service file to prevent systemd from interfering. · cecb39ff

Tim Wickberg authored Jul 19, 2018

Without Delegate=yes, systemd will "fix" the cgroup hierarchies whenever
'systemctl daemon-reload' is called, which will then remove any
restrictions placed on memory or device access for a given job.

This is a problem especially since 'systemctl daemon-reload' may be called
automatically by rpm/yum or a variety of config file mangers, leading to
jobs escaping from slurmd/slurmstepd's control.

This setting should work for systemd versions >= 205.
https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

Bug 5292.

cecb39ff

cons_gres - flesh out more core selection logic · 50e3466a
Morris Jette authored Jul 19, 2018

50e3466a
initialize a variable · 351c9648
Morris Jette authored Jul 19, 2018
```
addresses problem reported by clang
```
351c9648
Merge branch 'slurm-17.11' · 86fd2866
Tim Wickberg authored Jul 19, 2018

86fd2866
Merge branch 'slurm-17.02' into slurm-17.11 · 954830f5
Tim Wickberg authored Jul 19, 2018

954830f5
cons_tres logic fixes · a8103f54
Morris Jette authored Jul 19, 2018
```
bug introduced in commit a7d9313d
```
a8103f54

Fix segfault in hourly rollup · 346ce48b

Felip Moll authored Jul 19, 2018

When a job with time_end=0 and TRES null exists from an association that is
currently inside a reservation, the hourly rollup segfaults.

Bug 5143

346ce48b

Remove unused 'flags' argument from slurm_msg_sendto(). · 0d150ba9
Tim Wickberg authored Jul 17, 2018
```
And from underlying slurm_msg_sendto_timeout call as well.
```
0d150ba9

18 Jul, 2018 12 commits
- cons_tres: flesh out logic for gres to pick job cores · a7d9313d
  Morris Jette authored Jul 17, 2018
```
Add function to clear total_gres at start of scheduling cycle
Modify logic to avoid overflow on gpu counter
```
  a7d9313d
- Prevent possible divide by zero in _validate_time_limit(). · 993ce884
  Dominik Bartkiewicz authored Jul 18, 2018
```
As reported by Avalon Johnson on slurm-users
https://groups.google.com/forum/#!topic/slurm-users/BsMQ7Uk1PLw
Bug 5287.
```
  993ce884
- SchedulerParameters' "whole_pack" option has been renamed to "whole_hetjob" · 3807fef5
  Alejandro Sanchez authored Jul 18, 2018
```
bug 4373, comment #24
```
  3807fef5
- Fix documentation about enforce-binding opt. · 7cc2553b
  Felip Moll authored Jul 18, 2018
```
Removed the sentence which incorrectly stated that when not using the
gres flag enforce-binding option, cpus other than the ones defined in gres.conf
could be used for a gpu.

Bug 5189
```
  7cc2553b
- Fix grammar in RebootProgram docs · 72b4f3c4
  Brian Christiansen authored Jul 17, 2018
  
  72b4f3c4
- Fix printing off --hint options for sbatch, salloc · 17e6e23b
  Brian Christiansen authored Jul 16, 2018
```
srun was already fixed in b7053bda (Bug 3294).

Bug 5126
```
  17e6e23b
- Add xstrstr() · 40abb764
  Brian Christiansen authored Jul 16, 2018
  
  40abb764
- document --gres-flags=disable-binding · 6d349bc5
  Felip Moll authored Jul 17, 2018
```
bug 5189
```
  6d349bc5
- add job --gres-flags=disable-binding · aa61233b
  Morris Jette authored Jul 17, 2018
```
Add salloc/sbatch/srun option of --gres-flags=disable-binding to disable
    filtering of CPUs with respect to generic resource locality. This option is
    currently required to use more CPUs than are bound to a GRES (i.e. if a GPU
    is bound to the CPUs on one socket, but resources on more than one socket
    are required to run the job). This option may permit a job to be allocated
    resources sooner than otherwise possible, but may result in lower job
    performance.
bug 5189
```
  aa61233b
- Merge branch 'slurm-17.11' · 7fddd347
  Tim Wickberg authored Jul 17, 2018
  
  7fddd347
- Docs - Change to using 'show engines' for verifying InnoDB availability. · 79fd5e83
  Broderick Gardner authored Jul 17, 2018
```
'have_innodb' is deprecated.

Bug 5317.
```
  79fd5e83
- Use %zu to print size_t instead of %zd. · 764dfae5
  Broderick Gardner authored Jul 16, 2018
```
Cleanup printf formaters and ensure they match the types:
%zu for size_t
%zd for ssize_t

Bug 5417.
```
  764dfae5
17 Jul, 2018 7 commits
- Fix for formating when printing arrays in squeue · f1991701
  Felip Moll authored Jul 17, 2018
```
When printing arrays in squeue and setting the SLURM_BITSTR_LEN variable to 0
or to NULL, the length of the output defaulted to 64, when the documentation
says it will default to "unlimited". This patch fixes this situation.

Bug 5440
```
  f1991701
- Disable ConstrainKmemSpace by default. · 32fabc5e
  Marshall Garey authored Jul 17, 2018
```
Because of a bug in the some versions of the Linux kernel, disable
constraining kernel memory space with cgroups by default.

Bug 5223.
```
  32fabc5e
- Remove redundant call to create_mmap_buf. · cbbc6970
  Tim Wickberg authored Jul 17, 2018
  
  cbbc6970
- Remove redundant NULL pointer check · fc4dbed2
  Morris Jette authored Jul 16, 2018
```
Coverity CID 186991
```
  fc4dbed2
- Docs - fix reference to enable_user_top option. · 29cc55b7
  Marshall Garey authored Jul 16, 2018
```
Logic was switched around in 17.11, enable_user_top is now the
correct option.

Bug 5165.
```
  29cc55b7
- Merge branch 'slurm-17.11' · f4e8ec05
  Tim Wickberg authored Jul 16, 2018
  
  f4e8ec05
- Docs - Clarify MPI apps don't work with hetjobs in 17.11. · 3060b62e
  Alejandro Sanchez authored Jul 16, 2018
```
This is not working reliably even when setting
SchedulerParameters=enable_hetero_steps and/or using OpenMPI with Slurm's
mpi/pmi2, as it was previously documented.

Bug 5309.
```
  3060b62e