Commits · c9e99e7c26101aadc452fd917ccc4b319edf683d · Manuel G. Marciani / ces_slurm_simulator

27 Jul, 2018 9 commits

Change error("sched: ...") to sched_error(). · c9e99e7c

Tim Wickberg authored Jul 27, 2018

Collapse error message on to one line when they weren't previously,
and fixup argument indentation as well while here.

find . -name '*\.[ch]' -exec sed -i s/error\(\[\ \]*\"sched:\ /sched_error\(\"/ {} \;

c9e99e7c

Add sched_error() through sched_debug3() functions. · d42cbab9

Tim Wickberg authored Jul 27, 2018

These calls will replace all these:
	error("sched: message...")
with:
	sched_error("message...")

This allows the _log_msg function to stop calling xstrncmp() on every
log message while holding log_lock when SchedLogLevel > 0.
This behavior has effectively removed all concurrency from slurmctld when
SchedLogLevel is enabled, and prevented larger scale systems from being able
to start up - see bug 3746 for an example of this.

d42cbab9

Add additional 'sched' argument to _log_msg. · 6c649339

Tim Wickberg authored Jul 27, 2018

Update all calling locations, and as this is a static function
change the name to _log_msg from log_msg while here.

6c649339

Promote the only debug4("sched: "...) message to debug3. · 575e5b1a
Tim Wickberg authored Jul 27, 2018

575e5b1a
Merge branch 'slurm-17.11' · 143b6058
Tim Wickberg authored Jul 27, 2018

143b6058
Fix segfault in slurmctld when a job's node bitmap is NULL during a · fef07a40
Dominik Bartkiewicz authored Jul 27, 2018
```
scheduling cycle.  Primarily caused by EnforcePartLimits=ALL.

Bug 5452
```
fef07a40
Change to bool options instead of bitpacked values. · ac82934e
Tim Wickberg authored Jun 25, 2018

ac82934e
sacctmgr - add user_case_norm boolean to control case normalization of usernames. · afa5c666
Tim Wickberg authored Jul 26, 2018
```
This requires all calls to slurm_addto_char_list() to change to
slurm_addto_char_list_with_case().
```
afa5c666

Add extra parameter to slurm_addto_char_list(). · 2a8e51e4

Tim Wickberg authored Jul 26, 2018

Rather than replace all existing references, keep the existing function
name and signature, and break off the body into
slurm_addto_char_list_with_case() with the lower_case_normalization
option set to true.

Locations that need to disable this case normalization will thus have
access to this same function, just by the longer name with the extra
boolean.

2a8e51e4

25 Jul, 2018 8 commits

Fix incorrect expression with fdopen() call in _spank_stack_load(). · c409a53f

Tim Wickberg authored Jul 25, 2018

fdopen returns NULL on failure. It is impossible for (FILE *) to be less
than zero, so this check would have always succeeded even in the event
of a failure.

CID 187087.

c409a53f

Remove stray commented out include for a file that no longer exists there. · 5781daa2
Tim Wickberg authored Jul 25, 2018

5781daa2
Avoid leak of error_msg on error in _check_database_variables. · 8156817e
Tim Wickberg authored Jul 25, 2018
```
CID 187179.
```
8156817e
Do not leak fd on fstat() error in create_mmap_buf(). · 2d92a70d
Tim Wickberg authored Jul 25, 2018
```
CID 187085.
```
2d92a70d

Print clustername when sacctmgr show runaway does · d647d3f7

Marshall Garey authored Jul 25, 2018

not find any jobs. The actual syntax for that sacctmgr command is

sacctmgr show runawayjobs [clustername]

but sometimes people mistakenly do

sacctmgr show runaway jobs

in which case it would look for runaway jobs on a cluster named "jobs"
and print the message "Runaway Jobs: No runaway jobs found" because the
clustername was wrong. This patch changes that message to say "Runaway
Jobs: No runaway jobs found on cluster jobs" so that people know they
made a mistake in syntax.

d647d3f7

assoc_mgr_get_user_assocs() - Change check to be xassert instead. · 5c1e56de
Dominik Bartkiewicz authored Jul 25, 2018
```
Bug 5098
```
5c1e56de
Remove slurm.epilog.clean. · ec932007
Tim Wickberg authored Jul 25, 2018
```
pam_slurm_adopt is way better, please use that instead.
```
ec932007

Remove assoc_mgr locks from _restore_job_dependencies. · 5ebb92da

Tim Wickberg authored Jul 24, 2018

This fixes the previously mention double-locking issue.

These locks were originally introduced in dd6cdddb. Commit b298df5d
then moved the tres setup over to _load_job_state instead, and these are no
longer required here.

Bug 5469.

5ebb92da

24 Jul, 2018 7 commits

Fix spelling in man page · 074b0ea0
Brian Christiansen authored Jul 24, 2018

074b0ea0

Prevent double-locking in _restore_job_dependencies(). · 36f404db

Tim Wickberg authored Jul 24, 2018

assoc_mgr_clear_used_info() already manages its own locks,
so wait to lock until after that's been called.

Bug 5469.

36f404db

Revert "Handle locking for assoc_mgr_clear_used_info() upstream." · b6339c5a

Tim Wickberg authored Jul 24, 2018

Leads to deadlock in 'scontrol reconfigure':

0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
1  0x00007f61b1880801 in __GI_abort () at abort.c:79
2  0x00007f61b1fe2bd7 in __xassert_failed (expr=expr@entry=0x7f61b1ff2989
"_store_locks(locks)", file=file@entry=0x7f61b1ff2e80
"../../../../slurm/src/common/assoc_mgr.c",
    line=line@entry=2114, func=func@entry=0x7f61b1ff4bf8 <__func__.17903>
"assoc_mgr_lock") at ../../../../slurm/src/common/xassert.c:57
3  0x00007f61b1ead017 in assoc_mgr_lock (locks=locks@entry=0x7f61ad6da610) at
../../../../slurm/src/common/assoc_mgr.c:2114
4  0x00005632675e4509 in _adjust_limit_usage (type=type@entry=0,
job_ptr=job_ptr@entry=0x7f6190015000) at
../../../../slurm/src/slurmctld/acct_policy.c:719
5  0x00005632675e4b49 in acct_policy_add_job_submit
(job_ptr=job_ptr@entry=0x7f6190015000) at
../../../../slurm/src/slurmctld/acct_policy.c:2515
6  0x00005632676767c7 in _restore_job_dependencies () at
../../../../slurm/src/slurmctld/read_config.c:2612
7  read_slurm_conf (recover=recover@entry=1, reconfig=reconfig@entry=true) at
../../../../slurm/src/slurmctld/read_config.c:1310
8  0x0000563267668d0f in _slurm_rpc_reconfigure_controller
(msg=msg@entry=0x7f61ad6dade0) at
../../../../slurm/src/slurmctld/proc_req.c:3645
9  0x000056326766f646 in slurmctld_req (msg=0x7f61ad6dade0,
arg=0x7f619c000dc0) at ../../../../slurm/src/slurmctld/proc_req.c:425
10 0x00005632675efbc9 in _service_connection (arg=<optimized out>) at
../../../../slurm/src/slurmctld/controller.c:1285
11 0x00007f61b1c386db in start_thread (arg=0x7f61ad6db700) at
pthread_create.c:463
12 0x00007f61b196188f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Bug 5469.

This reverts commit 4b7ad3b6.

b6339c5a

Start lock annotation in assoc_mgr.c. · 0b4ed292
Tim Wickberg authored Jul 23, 2018

0b4ed292
Add verify_assoc_lock() for lock annotation. · 62e7331c
Tim Wickberg authored Jul 23, 2018

62e7331c
Add lock annotation infrastructure to assoc_mgr.c. · 112e3e20
Tim Wickberg authored Jul 23, 2018

112e3e20
Added database InnoDB settings verification to accounting storage plugin init · 0368fb33
Broderick Gardner authored Jun 20, 2018
```
Bug 5248.
```
0368fb33

23 Jul, 2018 5 commits

Fix leak of xauthority variable. · 25272951
Tim Wickberg authored Jul 23, 2018
```
CID 187110.
```
25272951

Handle locking for assoc_mgr_clear_used_info() upstream. · 4b7ad3b6

Tim Wickberg authored Jul 23, 2018

Avoid a double-locking issue by moving the locks previously
required by assoc_mgr_clear_used_info() up in the calling path
to _restore_job_dependencies().

Add commented out lock annotations to use later.

4b7ad3b6

Remove slurm_shutdown_msg_engine() in favor of close(). · 2cf8ceef
Tim Wickberg authored Jul 23, 2018
```
Nothing ever checks this return code anyways.
```
2cf8ceef

Remove slurm_shutdown_msg_conn() in favor of close(). · e3f45f55

Tim Wickberg authored Jul 23, 2018

This also cleans up several locations that could try to repeatedly call
close(). See prior commit for further details on why that is best avoided.

e3f45f55

Do not retry close() syscall on error. · a1e5c264

Tim Wickberg authored Jul 23, 2018

Quoting part of the close() man page:

Retrying the close() after a failure return is the wrong thing to do,
since this may cause a reused file descriptor from another thread to be
closed. This can occur because the Linux kernel always releases the
file descriptor early in the close operation, freeing it for reuse; the
steps that may return an error, such as flushing data to the filesystem
or device, occur only later in the close operation.

a1e5c264

21 Jul, 2018 4 commits

x11 - cleanup error path and overwrite DISPLAY if X11 forwarding failed. · a51d1600

Tim Wickberg authored Jul 21, 2018

Set DISPLAY to SLURM_X11_SETUP_FAILED to make it clear that the
tunnel setup has failed. This at least gives the user a hint as to
why their X11 apps aren't working, although further refinement should be
done later:

tim@zoidberg:~$ srun --x11 xclock
Error: Can't open display: SLURM_X11_SETUP_FAILED
srun: error: node001: task 0: Exited with exit code 1

a51d1600

Add X11Parameters=local_xauthority option. · 2a58e3e2

Tim Wickberg authored Jul 21, 2018

Creates a local XAUTHORITY file in TmpFS on the node, and deletes
it upon job termination. This avoids file locking contention on
~/.Xauthority in the users home directory.

Bug 3647.

2a58e3e2

Send tmpfs to slurmstepd as part of pack_slurmd_conf_lite(). · 70e893b8
Tim Wickberg authored Jul 21, 2018

70e893b8

X11 forwarding subsystem - add plumbing to permit a temporary XAUTHORITY file · 3b7d1625

Tim Wickberg authored Jul 21, 2018

Build out sufficient plumbing such that a temporary XAUTHORITY file
can be used that is local to the compute node, thus avoiding lock
contention on ~/.Xauthority on parallel filesystems.

This commit only includes the requisite plumbing to pass this around.

If this is not used, a null string results, and the XAUTHORITY env var
will not be forced into the user environment.

Add support and fix the modified API call in pam_slurm_adopt while here.

Bug 3647.

3b7d1625

20 Jul, 2018 3 commits
- Modify test to eliminate vestigial file · 3ac05af0
  Morris Jette authored Jul 20, 2018
  
  3ac05af0
- cons_tres: enforce job max cpu limit · 7440f1ad
  Morris Jette authored Jul 19, 2018
  
  7440f1ad
- Merge branch 'slurm-17.11' · 4f04764a
  Tim Wickberg authored Jul 19, 2018
  
  4f04764a
19 Jul, 2018 4 commits
- Start NEWS for v17.11.9 · 8b27b9c9
  Tim Wickberg authored Jul 19, 2018
  
  8b27b9c9
- Update META for v17.11.8. · 07ad0727
  Tim Wickberg authored Jul 19, 2018
```
Update slurm.spec and slurm.spec-legacy as well.
```
  07ad0727
- Add NEWS entry missed on prior commit. · 380abb0b
  Tim Wickberg authored Jul 19, 2018
  
  380abb0b
- Use one macro for all listen() backlog arguments. · b039ba24
  Tim Wickberg authored Jul 19, 2018
```
The lower limit of 1024 may be too short for srun with large-scale
jobs, and lead to problems processing task completion messages in a
timely fashion.

Rather than adjust that, unify the two separate macros into
SLURM_DEFAULT_LISTEN_BACKLOG with the higer 4096 value.

Bug 5164.
```
  b039ba24