- 08 Dec, 2015 6 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 07 Dec, 2015 11 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
Conflicts: src/plugins/jobcomp/elasticsearch/jobcomp_elasticsearch.c
-
Morris Jette authored
Avoid logging every time that some interaction with the elasticsearch server happens. That would generate too many log messages. Fix allignment problem in the code without changing logic.
-
Alejandro Sanchez authored
-
Alejandro Sanchez authored
Added a limit of MAX_JOBS 100000 jobs enqueued and waiting to be indexed
-
Alejandro Sanchez authored
-
Morris Jette authored
When slurmctld is reconfigured, it makes sure that every pending batch job has a script and if not mark the job as failed. If the script is present with no job, then purge the file/directory. This re-write of the logic takes advantage of hash tables to improve the performance of the algorithm and is an improvement on the logic in commit 862f3c8e and will work with the version 15.08 code base as well Time for for slurmctld reconfig with 110,000 jobs/files 37.8 sec with original code 2.2 sec with commit 862f3c8e 0.8 sec with latest code bug 2220
-
Tim Wickberg authored
Usernames are comma separated, not colon delimited. Bug #2222. While here fix a few spelling mistakes.
-
- 06 Dec, 2015 9 commits
-
-
jette authored
-
Sourav Chakraborty authored
Fixed a typo that was causing slurm_cred_copy to segfault
-
jette authored
This adds some job memory limits to avoid tests failing with a configuration having SelectTypeParameters=CR_Memory and NO default memory size.
-
jette authored
-
jette authored
-
jette authored
Conflicts: src/plugins/jobcomp/elasticsearch/jobcomp_elasticsearch.c
-
Tim Wickberg authored
If a job is configured to be sent a signal when approaching its timeout and the job is requeued after the signal is sent, then resend the signal at the approriate time after requeue. bug 1415
-
jette authored
If the backfill scheduler can not start at the appointed time, say because there are too many pending RPC, then only sleep 1 second and try again rather than the full bf_interval (30 seconds by default).
-
jette authored
Same logic was in both backfill plugin and slurmctld/job_scheduler.c No change in logic
-
- 05 Dec, 2015 3 commits
-
-
Brian Christiansen authored
Bug 2130
-
Brian Christiansen authored
Adopted processes didn't have access to the job's devices. Bug 2130
-
Brian Christiansen authored
-
- 04 Dec, 2015 6 commits
-
-
Morris Jette authored
When slurmctld processes a SIGHUP or "scontrol reconfig" it makes sure that every pending batch job as a directory containing script and environment files, cancelling any job that lacks the files and purging any files without a job. When there are a large number of jobs (in any state) and a large number of files, the nested loops to do this can run for a very long time (minutes). This change streamlines the processing for job arrays, which now avoid creating their own script and environment files by using that of their job array master record.
-
Danny Auble authored
Full revert of c2fbf88f, 13b64c35 had caught part of this, but this will revert it completely. The code just wasn't needed in modern Slurm. It appears the patch came from an older version of Slurm that didn't handle this correctly.
-
David Bigagli authored
This reverts commit 29f25688. Conflicts: NEWS Looks like this isn't needed, commit c2fbf88f doesn't appear to be needed and is what is causing this issue. c2fbf88f was added from an older version of Slurm where this was already handled correctly in commit 815e5a44.
-
Danny Auble authored
get messages like slurmctld: error: _handle_assoc_tres_run_secs: job 33355: assoc 1 TRES node grp_used_tres_run_secs underflow, tried to remove 21 seconds when only 0 remained. which are bad.
-
Morris Jette authored
-
Alejandro Sanchez authored
If a store fails, do not retry for 30 seconds
-
- 03 Dec, 2015 5 commits
-
-
Morris Jette authored
Cray job NHC delayed until after burst buffer released and epilog completes on all allocated nodes. bugs 2099 and 2192
-
Brian Christiansen authored
-
Morris Jette authored
Add new buffer state of BB_STATE_POST_RUN indicating the post_run operation has been started Add new bb_p_job_test_post_run function to test status of post_run operation Execute post_run operation before the stage_out operation to opimize use of compute nodes (rather than burst buffer space)
-
Morris Jette authored
Release a job's allocated licenses only after epilog runs on all nodes rather than at start of termination process. bug 2192
-
Morris Jette authored
sched/backfill - Delay backfill scheduler for completing jobs only if CompleteWait configuration parameter is set (make code match documentation).
-