- 19 Jan, 2017 3 commits
-
-
Morris Jette authored
If job is allocated nodes which are powered down, then reset job start time when the nodes are ready and do not charge the job for power up time. bug 3411
-
Morris Jette authored
Add logic to monitor Uncorrectable Memory Errors (UME) and notify active jobs in case they run for a while afterwards. This copies logic from knl_generic to knl_cray. There may be a different UME monitoring system for Cray systems in the future. The original knl_generic development is in commit 56ff27da bug 3341
-
Morris Jette authored
bug 3390
-
- 18 Jan, 2017 2 commits
-
-
Aaron Knister authored
ensure eio objects get explicitly shutdown when eio_handle_mainloop exits. currently depending on whether the order the eio_handle_mainloop and eio_signal_shutdown get called relative to each other when stepd is instructed to shut down the socket use SHUT_RDWR instead of SHUT_RD. just using SHUT_RD can cause srun to receive ECONNRESET if there's outstanding data that's been sent to stepd that the task has not read. bug 3166
-
Morris Jette authored
bug 3099
-
- 17 Jan, 2017 5 commits
-
-
Danny Auble authored
This reverts commit e92b49d3.
-
Dominik Bartkiewicz authored
instead of also in the backfill scheduler.
-
Morris Jette authored
Avoid allocating resources to a job in the event that its run time plus boot time (if needed) extent into an advanced reservation. Note that if resources have not yet been selected (e.g. determining availability of licenses or burst buffers with respect to advanced reservations) then assume a reboot will be required. bug 3360
-
Josh Samuelson authored
Bug 3405.
-
Josh Samuelson authored
acct_policy_job_runnable_pre_select() calls assoc_mgr_set_qos_tres_cnt() without tres READ_LOCK. Note that existing code does not modify the tres structures, so this cannot currently lead to a race condition. Bug 3406.
-
- 15 Jan, 2017 1 commit
-
-
Michael Robbert authored
job_submit/cnode was previously removed by commit 63bc71ed. Bug 3403.
-
- 14 Jan, 2017 1 commit
-
-
Morris Jette authored
Add BootTime configuration parameter to knl.conf file to optimize resource allocations with respect to required node reboots. Add node_features_p_boot_time() to node_features plugin to optimize scheduling with respect to node reboots. bug 3360
-
- 13 Jan, 2017 3 commits
-
-
Tim Wickberg authored
pack_bit_fmt can exceed 0xfffe characters on large systems and thus truncate leading to: "error: Credential signature check: Credential data size mismatch" Add warning to the pack_bit_fmt macro to highlight issue. Bug 3257.
-
Morris Jette authored
scancel modified to note that no jobs satisfy the filter options when the --verbose option is used along with one or more job filters (e.g. "--qos="). bug 3072
-
Alejandro Sanchez authored
scancel would treat a non-numeric argument as the name of jobs to be cancelled (a non-documented feature). Cancelling jobs by name now require the "--jobname=" command line argument. bug 3072
-
- 12 Jan, 2017 3 commits
-
-
Dominik Bartkiewicz authored
Replace version borrowed from Linux 2.4.9 with newer version in 4.9.
-
Morris Jette authored
burst_buffer/cray - Avoid "pre_run" operation if not using buffer (i.e. just creating or deleting a persistent burst buffer). bug 3391
-
Morris Jette authored
Previous job state information was "PENDING" rather than "REQUEUED" for each job requeued due to a burst buffer error. bug 3388
-
- 11 Jan, 2017 4 commits
-
-
Danny Auble authored
scheduling a Datawarp job. The assoc_mgr lock needs to happen before the bb_state.bb_mutex. One place this could cause deadlock is from src/slurmctld/controller.c _accounting_cluster_ready() which calls clusteracct_storage_g_cluster_tres which inturn calls bb_g_job_set_tres_cnt which calls bb_p_job_set_tres_cnt which will lock the bb_muxtex after the assoc_mgr is already locked. Bug 3389
-
Danny Auble authored
Bug 3331
-
Dominik Bartkiewicz authored
Cache results of bit_set_count() calls. Bug 3393.
-
Morris Jette authored
The old logic would result in test16.4 failing some of the time. The failure was caused by the sattach command attaching to a job step before the original srun command received a RESPONSE_LAUNCH_TASKS message. That messsage would then be sent to the salloc command. Since srun never got the message, it would hang. This change does not mark the job step as RUNNING until after the original srun gets sent the RESPONSE_LAUNCH_TASKS message and sattach requests are blocked until that time.
-
- 09 Jan, 2017 6 commits
-
-
Morris Jette authored
backfill scheduler: Stop trying to determine expected start time for a job after 2 seconds of wall time. This can happen if there are many running jobs and a pending job can not be started soon. byg 3373
-
Tim Wickberg authored
This reverts commit 17549a03.
-
Dominik Bartkiewicz authored
Bug 3364.
-
Tim Shaw authored
Configuring slurm with munge manually installed in /usr/local, with the library in /usr/local/lib but an empty /usr/local/lib64 directory will cause the munge plugins to look for libmunge.so in the wrong place. The munge.spec file has historically provided libmunge.so as part of munge-devel, which Slurm depends on already.
-
Morris Jette authored
Add SchedulerParameters configuration parameter of "default_gbytes", which treats numeric only (no suffix) value for memory and tmp disk space as being in units of Gigabytes. Mostly for compatability with LSF.
-
Morris Jette authored
Move BatchScript to end of each job's information when using "scontrol -dd show job" to make it more readable.
-
- 06 Jan, 2017 1 commit
-
-
Tim Wickberg authored
Can cause random assertion failures and core dumps due to differences between definition in slurm.h and bitstring.h. Inadverently introduced in 8967a4e7.
-
- 05 Jan, 2017 2 commits
-
-
Alejandro Sanchez authored
17.02 API has been changed so that node Port parameter is now packed and unpacked on REQUEST_NODE_INFO RPC. Some client requests such as 'scontrol write config', 'scontrol show node' will display the port if different to SlurmdPort. Port parameter is also available now to 'sinfo' if explicitly requested through '-O port' and to the 'sview' full node info. Always send SlurmdPort in RPC even when in multiple-slurmd mode. Bug 3240.
-
Doug Jacobsen authored
Bug 3376.
-
- 04 Jan, 2017 5 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Tim Wickberg authored
Fix security issue caused by insecure file path handling triggered by the failure of a Prolog script. To exploit this a user needs to anticipate or cause the Prolog to fail for their job. (This commit is slightly different from the fix to the 15.08 branch.) CVE-2016-10030.
-
Tim Wickberg authored
-
Tim Wickberg authored
Fix security issue caused by insecure file path handling triggered by the failure of a Prolog script. To exploit this a user needs to anticipate or cause the Prolog to fail for their job. CVE-2016-10030.
-
- 03 Jan, 2017 3 commits
-
-
Dominik Bartkiewicz authored
-
Dominik Bartkiewicz authored
Prevent "stray" jobs from using resources when the srun/salloc will never launch the actual compute tasks. Bug 3344.
-
Dominik Bartkiewicz authored
PluginDir is allowed to be a PATH-style list of directories; remove incorrect test of the variable as if it were a single directory and comment that the check for that is elsewhere. Bug 3361.
-
- 29 Dec, 2016 1 commit
-
-
Morris Jette authored
Add SchedulerParameters option of spec_cores_first to select specialized cores from the lowest rather than highest number cores and sockets. bug 3349
-