- 01 Feb, 2017 1 commit
-
-
Chansup Byun authored
-
- 31 Jan, 2017 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Alejandro Sanchez authored
-
- 30 Jan, 2017 3 commits
-
-
Morris Jette authored
Properly set SLURM_JOB_GPUS environment variable for Prolog. bug 3437
-
Morris Jette authored
Clear job's reason of "BeginTime" in a more timely fashion and/or prevents them from being stuck in a PENDING state. There are multiple ways of clearing the reason, especially on a lightly loaded system, but the state can persist indefinitely on a heavily loaded system. bug 3368
-
Morris Jette authored
Fix to logic for getting expected start time of existing job ID with explicit begin time that is in the past. Previous logic would compare that (past) begin time with advanced reservations that would compete with it rather than the current time.
-
- 29 Jan, 2017 1 commit
-
-
Morris Jette authored
CRAY systems only: TaskPlugins must list task/cgroup before task/cray in order for the cgroup files to be created before task/cray runs. Without this change, the task/cray plugin frequently produces errors about the "mems" file being missing. The errors don't seem consistent, so this probably involves a race condition. Note that NERSC uses this order today and I changed read_config.c to produce a fatal error if the order is reversed.
-
- 27 Jan, 2017 2 commits
-
-
Danny Auble authored
Turns out this never worked, ever. What used to happen is if the protocol_version that was read in didn't match the rpc_version given to unpack things was just 0. What this does now is set the rpc_version to what was stored making it all good.
-
Morris Jette authored
Revert logic originally added for bug 3166. Revisit as time permits. bug 3166
-
- 26 Jan, 2017 1 commit
-
-
Alejandro Sanchez authored
Bug 3431
-
- 25 Jan, 2017 4 commits
-
-
Morris Jette authored
burst_buffer/cray - Fix race condition that could cause multiple batch job launch requests resulting in downed nodes. bug 3366
-
Dominik Bartkiewicz authored
-
Danny Auble authored
This reverts commit b9bff82f.
-
Danny Auble authored
-
- 23 Jan, 2017 2 commits
-
-
Danny Auble authored
Bug 1599
-
Morris Jette authored
slurmctld/agent race condition fix: Prevent job launch while PrologSlurmctld daemon is running or node boot in progress. bug 3366
-
- 20 Jan, 2017 3 commits
-
-
Brian Christiansen authored
-
Brian Christiansen authored
If a lower version client would try to communicate with a higher version controller the dbd would return the controller's version and the client would use that version to talk to the controller. When the controller would respond, the client wouldn't know how to unpack the higher version msg.
-
Danny Auble authored
Bug 2508
-
- 19 Jan, 2017 5 commits
-
-
Danny Auble authored
-
Morris Jette authored
If job is allocated nodes which are powered down, then reset job start time when the nodes are ready and do not charge the job for power up time. bug 3411
-
Danny Auble authored
-
Morris Jette authored
Add logic to monitor Uncorrectable Memory Errors (UME) and notify active jobs in case they run for a while afterwards. This copies logic from knl_generic to knl_cray. There may be a different UME monitoring system for Cray systems in the future. The original knl_generic development is in commit 56ff27da bug 3341
-
Morris Jette authored
bug 3390
-
- 18 Jan, 2017 4 commits
-
-
Aaron Knister authored
ensure eio objects get explicitly shutdown when eio_handle_mainloop exits. currently depending on whether the order the eio_handle_mainloop and eio_signal_shutdown get called relative to each other when stepd is instructed to shut down the socket use SHUT_RDWR instead of SHUT_RD. just using SHUT_RD can cause srun to receive ECONNRESET if there's outstanding data that's been sent to stepd that the task has not read. bug 3166
-
Danny Auble authored
Bug 3398
-
Danny Auble authored
-
Morris Jette authored
bug 3099
-
- 17 Jan, 2017 5 commits
-
-
Danny Auble authored
This reverts commit e92b49d3.
-
Dominik Bartkiewicz authored
instead of also in the backfill scheduler.
-
Morris Jette authored
Avoid allocating resources to a job in the event that its run time plus boot time (if needed) extent into an advanced reservation. Note that if resources have not yet been selected (e.g. determining availability of licenses or burst buffers with respect to advanced reservations) then assume a reboot will be required. bug 3360
-
Josh Samuelson authored
Bug 3405.
-
Josh Samuelson authored
acct_policy_job_runnable_pre_select() calls assoc_mgr_set_qos_tres_cnt() without tres READ_LOCK. Note that existing code does not modify the tres structures, so this cannot currently lead to a race condition. Bug 3406.
-
- 15 Jan, 2017 1 commit
-
-
Michael Robbert authored
job_submit/cnode was previously removed by commit 63bc71ed. Bug 3403.
-
- 14 Jan, 2017 1 commit
-
-
Morris Jette authored
Add BootTime configuration parameter to knl.conf file to optimize resource allocations with respect to required node reboots. Add node_features_p_boot_time() to node_features plugin to optimize scheduling with respect to node reboots. bug 3360
-
- 13 Jan, 2017 3 commits
-
-
Tim Wickberg authored
pack_bit_fmt can exceed 0xfffe characters on large systems and thus truncate leading to: "error: Credential signature check: Credential data size mismatch" Add warning to the pack_bit_fmt macro to highlight issue. Bug 3257.
-
Morris Jette authored
scancel modified to note that no jobs satisfy the filter options when the --verbose option is used along with one or more job filters (e.g. "--qos="). bug 3072
-
Alejandro Sanchez authored
scancel would treat a non-numeric argument as the name of jobs to be cancelled (a non-documented feature). Cancelling jobs by name now require the "--jobname=" command line argument. bug 3072
-