- 16 Mar, 2016 7 commits
-
-
Morris Jette authored
Previous gang scheduling logic maintained information about resources originally allocated to the job and made scheduling decisions on that basis. bug 2494
-
Morris Jette authored
Update gang scheduling table when job manually suspended or resumed. Prior logic could mess up job suspend/resume sequencing. bug 2494
-
Danny Auble authored
time. https://bugs.schedmd.com/show_bug.cgi?id=2547 The code just wasn't fully baked before and was probably written before a lot of the other supporting code was done i.e assoc_mgr_set_assoc|qos_tres_cnt were done specifically for this kind of thing. Many of the usage structures weren't realloced either as well as the tres_cnt local to each qos and assoc wasn't updated. So all in all pretty bad code - bad Danny. This makes sure all this sets up and no memory corruption happens.
-
Morris Jette authored
Generate burst buffer use completion email immediately afer teardown completes rather than at job purge time (likely minutes later). bug 2539
-
Morris Jette authored
Change burst buffer use completion message from "SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47" to "SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47"
-
Brian Christiansen authored
-
Brian Christiansen authored
Bug 2396
-
- 15 Mar, 2016 2 commits
-
-
Alejandro Sanchez authored
-
Tim Wickberg authored
Bug 2543.
-
- 14 Mar, 2016 2 commits
-
-
Danny Auble authored
on only one port like TopologyParam=NoInAddrAny does for everything else.
-
Tim Wickberg authored
There's no /proc on *BSD, and BSD handles OOM in a completely different way.
-
- 12 Mar, 2016 1 commit
-
-
Morris Jette authored
-
- 11 Mar, 2016 3 commits
-
-
Morris Jette authored
-
Tim Wickberg authored
Return [0-100:2] formatting, rather than [0,2,4,6,8,...] when using a step function. Was inadvertantly broken in 14.11 with commit 5ffdca92. Bug 2535.
-
Morris Jette authored
Need higher count for KNL processor.
-
- 10 Mar, 2016 5 commits
-
-
Morris Jette authored
Fix Cray NHC spawning on job requeue. Previous logic would leave nodes allocated to a requeued job as non-usable on job termination. Specifically, each job has a "cleaning/cleaned" flag. Once a job terminates, the cleaning flag is set, then after the job node health check completes, the value gets set to cleaned. If the job is requeued, on its second (or subsequent) termination, the select/cray plugin is called to launch the NHC. The plugin sees the "cleaned" flag already set, it then logs: error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen and returns, never launching the NHC. Since the termination of the job NHC triggers releasing job resources (CPUs, memory, and GRES), those resources are never released for use by other jobs. Bug 2384
-
David Gloe authored
An error in slurmconfgen_smw.py caused it to parse the nic as the nid. On some systems those values differ, causing the generated slurm.conf file to be incorrect. Bug 2532.
-
Bill Brophy authored
route_p_split_hostlist was not thread-safe, and would cause one of several segfaults depending on where in the initialization code each thread was. Bug 2495.
-
Tim Wickberg authored
Was incorrectly displaying "(null)" even when loaded successfully.
-
Morris Jette authored
-
- 09 Mar, 2016 2 commits
-
-
Morris Jette authored
Fix Cray NHC spawning on job requeue. Previous logic would leave nodes allocated to a requeued job as non-usable on job termination. Specifically, each job has a "cleaning/cleaned" flag. Once a job terminates, the cleaning flag is set, then after the job node health check completes, the value gets set to cleaned. If the job is requeued, on its second (or subsequent) termination, the select/cray plugin is called to launch the NHC. The plugin sees the "cleaned" flag already set, it then logs: error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen and returns, never launching the NHC. Since the termination of the job NHC triggers releasing job resources (CPUs, memory, and GRES), those resources are never released for use by other jobs. Bug 2384
-
David Gloe authored
An error in slurmconfgen_smw.py caused it to parse the nic as the nid. On some systems those values differ, causing the generated slurm.conf file to be incorrect. Bug 2532.
-
- 08 Mar, 2016 2 commits
-
-
Bill Brophy authored
route_p_split_hostlist was not thread-safe, and would cause one of several segfaults depending on where in the initialization code each thread was. Bug 2495.
-
Tim Wickberg authored
Was incorrectly displaying "(null)" even when loaded successfully.
-
- 07 Mar, 2016 1 commit
-
-
Dominik Bartkiewicz authored
Added new job dependency type of "aftercorr" which will start a task of a job array after the corresponding task of another job array completes. bug 2460
-
- 05 Mar, 2016 2 commits
-
-
Danny Auble authored
would only track gres/gpu, now it will track both gres/gpu and gres/gpu:tesla as separate gres if configured like AccountingStorageTRES=gres/gpu,gres/gpu:tesla
-
Danny Auble authored
-
- 04 Mar, 2016 3 commits
-
-
Danny Auble authored
-
Brian Christiansen authored
Continuation of 31225a82
-
Brian Christiansen authored
Bug 2430
-
- 03 Mar, 2016 5 commits
-
-
Thomas Hamel authored
We want to introduce a new behavior in the way slurmd uses the HealthCheckProgram. The idea is to avoid a race condition between the first HealthCheckProgram run and the node accepting jobs. The slurmd daemon will initialize and then loop on HealthCheckProgram execution before registering with slurmctld. It will stay in this loop until the HealthCheckProgram returns successfully (the node is still DOWN). On our clusters we are using NHC as an HealthCheckProgram. NHC drains the node if it fails and remove the drain if it is successfull, this behavior fits well with our purpose. This behavior permits us to start slurmd at boot without setting up a complex boot sequence in the init system, slurmd just wait for the node to be ready before registering. The HealthCheckProgram is not run during slurmd startup if HealthCheckInteval is 0.
-
Danny Auble authored
-
Brian Christiansen authored
Bug 2507
-
Morris Jette authored
Step GRES value changed from type "int" to "int64_t" to support larger values. Previous logic could fail in step allocation values over 32-bits. Other GRES values are 64-bit.
-
Danny Auble authored
slurmstepd to close potential open ones. It was pointed out the slurmd using acct_gather_energy/ipmi links to freeipmi which could possibly open /dev/ipmi0 without the close on exec flag set as root while launching a step leaving it open in the users app. What this does is sets the flag on the first 256 to mitigate the concern. Reported by Maksym Planeta. Bug 2506
-
- 02 Mar, 2016 4 commits
-
-
Gary B Skouson authored
Previous logic tested whatever the job's partition pointer indicated rather than the partition we are trying to run the job in. This bug was introduced in Slurm version 15.08.5, Nov 16, 2015, commit 94f0e948 bug 2499
-
Tim Wickberg authored
-
Thomas Cadeau authored
-
Morris Jette authored
Check that PowerSave mode configured for node_features/knl_cray plugin. It is required to reconfigure and reboot nodes. Fatal error if not configured.
-
- 01 Mar, 2016 1 commit
-
-
Tim Wickberg authored
src/common/mapping.h was the one place outside of slurm/*h that used this, just remove it from there. Replace macro with #ifdef __cplusplus in slurm/*h in case anyone is linking C++ against libslurm.
-