- 27 Jul, 2018 9 commits
-
-
Tim Wickberg authored
Collapse error message on to one line when they weren't previously, and fixup argument indentation as well while here. find . -name '*\.[ch]' -exec sed -i s/error\(\[\ \]*\"sched:\ /sched_error\(\"/ {} \;
-
Tim Wickberg authored
These calls will replace all these: error("sched: message...") with: sched_error("message...") This allows the _log_msg function to stop calling xstrncmp() on every log message while holding log_lock when SchedLogLevel > 0. This behavior has effectively removed all concurrency from slurmctld when SchedLogLevel is enabled, and prevented larger scale systems from being able to start up - see bug 3746 for an example of this.
-
Tim Wickberg authored
Update all calling locations, and as this is a static function change the name to _log_msg from log_msg while here.
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Dominik Bartkiewicz authored
scheduling cycle. Primarily caused by EnforcePartLimits=ALL. Bug 5452
-
Tim Wickberg authored
-
Tim Wickberg authored
This requires all calls to slurm_addto_char_list() to change to slurm_addto_char_list_with_case().
-
Tim Wickberg authored
Rather than replace all existing references, keep the existing function name and signature, and break off the body into slurm_addto_char_list_with_case() with the lower_case_normalization option set to true. Locations that need to disable this case normalization will thus have access to this same function, just by the longer name with the extra boolean.
-
- 25 Jul, 2018 8 commits
-
-
Tim Wickberg authored
fdopen returns NULL on failure. It is impossible for (FILE *) to be less than zero, so this check would have always succeeded even in the event of a failure. CID 187087.
-
Tim Wickberg authored
-
Tim Wickberg authored
CID 187179.
-
Tim Wickberg authored
CID 187085.
-
Marshall Garey authored
not find any jobs. The actual syntax for that sacctmgr command is sacctmgr show runawayjobs [clustername] but sometimes people mistakenly do sacctmgr show runaway jobs in which case it would look for runaway jobs on a cluster named "jobs" and print the message "Runaway Jobs: No runaway jobs found" because the clustername was wrong. This patch changes that message to say "Runaway Jobs: No runaway jobs found on cluster jobs" so that people know they made a mistake in syntax.
-
Dominik Bartkiewicz authored
Bug 5098
-
Tim Wickberg authored
pam_slurm_adopt is way better, please use that instead.
-
Tim Wickberg authored
This fixes the previously mention double-locking issue. These locks were originally introduced in dd6cdddb. Commit b298df5d then moved the tres setup over to _load_job_state instead, and these are no longer required here. Bug 5469.
-
- 24 Jul, 2018 7 commits
-
-
Brian Christiansen authored
-
Tim Wickberg authored
assoc_mgr_clear_used_info() already manages its own locks, so wait to lock until after that's been called. Bug 5469.
-
Tim Wickberg authored
Leads to deadlock in 'scontrol reconfigure': 0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 1 0x00007f61b1880801 in __GI_abort () at abort.c:79 2 0x00007f61b1fe2bd7 in __xassert_failed (expr=expr@entry=0x7f61b1ff2989 "_store_locks(locks)", file=file@entry=0x7f61b1ff2e80 "../../../../slurm/src/common/assoc_mgr.c", line=line@entry=2114, func=func@entry=0x7f61b1ff4bf8 <__func__.17903> "assoc_mgr_lock") at ../../../../slurm/src/common/xassert.c:57 3 0x00007f61b1ead017 in assoc_mgr_lock (locks=locks@entry=0x7f61ad6da610) at ../../../../slurm/src/common/assoc_mgr.c:2114 4 0x00005632675e4509 in _adjust_limit_usage (type=type@entry=0, job_ptr=job_ptr@entry=0x7f6190015000) at ../../../../slurm/src/slurmctld/acct_policy.c:719 5 0x00005632675e4b49 in acct_policy_add_job_submit (job_ptr=job_ptr@entry=0x7f6190015000) at ../../../../slurm/src/slurmctld/acct_policy.c:2515 6 0x00005632676767c7 in _restore_job_dependencies () at ../../../../slurm/src/slurmctld/read_config.c:2612 7 read_slurm_conf (recover=recover@entry=1, reconfig=reconfig@entry=true) at ../../../../slurm/src/slurmctld/read_config.c:1310 8 0x0000563267668d0f in _slurm_rpc_reconfigure_controller (msg=msg@entry=0x7f61ad6dade0) at ../../../../slurm/src/slurmctld/proc_req.c:3645 9 0x000056326766f646 in slurmctld_req (msg=0x7f61ad6dade0, arg=0x7f619c000dc0) at ../../../../slurm/src/slurmctld/proc_req.c:425 10 0x00005632675efbc9 in _service_connection (arg=<optimized out>) at ../../../../slurm/src/slurmctld/controller.c:1285 11 0x00007f61b1c386db in start_thread (arg=0x7f61ad6db700) at pthread_create.c:463 12 0x00007f61b196188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Bug 5469. This reverts commit 4b7ad3b6.
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Broderick Gardner authored
Bug 5248.
-
- 23 Jul, 2018 5 commits
-
-
Tim Wickberg authored
CID 187110.
-
Tim Wickberg authored
Avoid a double-locking issue by moving the locks previously required by assoc_mgr_clear_used_info() up in the calling path to _restore_job_dependencies(). Add commented out lock annotations to use later.
-
Tim Wickberg authored
Nothing ever checks this return code anyways.
-
Tim Wickberg authored
This also cleans up several locations that could try to repeatedly call close(). See prior commit for further details on why that is best avoided.
-
Tim Wickberg authored
Quoting part of the close() man page: Retrying the close() after a failure return is the wrong thing to do, since this may cause a reused file descriptor from another thread to be closed. This can occur because the Linux kernel always releases the file descriptor early in the close operation, freeing it for reuse; the steps that may return an error, such as flushing data to the filesystem or device, occur only later in the close operation.
-
- 21 Jul, 2018 4 commits
-
-
Tim Wickberg authored
Set DISPLAY to SLURM_X11_SETUP_FAILED to make it clear that the tunnel setup has failed. This at least gives the user a hint as to why their X11 apps aren't working, although further refinement should be done later: tim@zoidberg:~$ srun --x11 xclock Error: Can't open display: SLURM_X11_SETUP_FAILED srun: error: node001: task 0: Exited with exit code 1
-
Tim Wickberg authored
Creates a local XAUTHORITY file in TmpFS on the node, and deletes it upon job termination. This avoids file locking contention on ~/.Xauthority in the users home directory. Bug 3647.
-
Tim Wickberg authored
-
Tim Wickberg authored
Build out sufficient plumbing such that a temporary XAUTHORITY file can be used that is local to the compute node, thus avoiding lock contention on ~/.Xauthority on parallel filesystems. This commit only includes the requisite plumbing to pass this around. If this is not used, a null string results, and the XAUTHORITY env var will not be forced into the user environment. Add support and fix the modified API call in pam_slurm_adopt while here. Bug 3647.
-
- 20 Jul, 2018 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Tim Wickberg authored
-
- 19 Jul, 2018 4 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
Update slurm.spec and slurm.spec-legacy as well.
-
Tim Wickberg authored
-
Tim Wickberg authored
The lower limit of 1024 may be too short for srun with large-scale jobs, and lead to problems processing task completion messages in a timely fashion. Rather than adjust that, unify the two separate macros into SLURM_DEFAULT_LISTEN_BACKLOG with the higer 4096 value. Bug 5164.
-