- 10 Mar, 2016 4 commits
-
-
Morris Jette authored
Fix Cray NHC spawning on job requeue. Previous logic would leave nodes allocated to a requeued job as non-usable on job termination. Specifically, each job has a "cleaning/cleaned" flag. Once a job terminates, the cleaning flag is set, then after the job node health check completes, the value gets set to cleaned. If the job is requeued, on its second (or subsequent) termination, the select/cray plugin is called to launch the NHC. The plugin sees the "cleaned" flag already set, it then logs: error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen and returns, never launching the NHC. Since the termination of the job NHC triggers releasing job resources (CPUs, memory, and GRES), those resources are never released for use by other jobs. Bug 2384
-
David Gloe authored
An error in slurmconfgen_smw.py caused it to parse the nic as the nid. On some systems those values differ, causing the generated slurm.conf file to be incorrect. Bug 2532.
-
Bill Brophy authored
route_p_split_hostlist was not thread-safe, and would cause one of several segfaults depending on where in the initialization code each thread was. Bug 2495.
-
Tim Wickberg authored
Was incorrectly displaying "(null)" even when loaded successfully.
-
- 07 Mar, 2016 1 commit
-
-
Dominik Bartkiewicz authored
Added new job dependency type of "aftercorr" which will start a task of a job array after the corresponding task of another job array completes. bug 2460
-
- 05 Mar, 2016 2 commits
-
-
Danny Auble authored
would only track gres/gpu, now it will track both gres/gpu and gres/gpu:tesla as separate gres if configured like AccountingStorageTRES=gres/gpu,gres/gpu:tesla
-
Danny Auble authored
-
- 04 Mar, 2016 3 commits
-
-
Danny Auble authored
-
Brian Christiansen authored
Continuation of 31225a82
-
Brian Christiansen authored
Bug 2430
-
- 03 Mar, 2016 5 commits
-
-
Thomas Hamel authored
We want to introduce a new behavior in the way slurmd uses the HealthCheckProgram. The idea is to avoid a race condition between the first HealthCheckProgram run and the node accepting jobs. The slurmd daemon will initialize and then loop on HealthCheckProgram execution before registering with slurmctld. It will stay in this loop until the HealthCheckProgram returns successfully (the node is still DOWN). On our clusters we are using NHC as an HealthCheckProgram. NHC drains the node if it fails and remove the drain if it is successfull, this behavior fits well with our purpose. This behavior permits us to start slurmd at boot without setting up a complex boot sequence in the init system, slurmd just wait for the node to be ready before registering. The HealthCheckProgram is not run during slurmd startup if HealthCheckInteval is 0.
-
Danny Auble authored
-
Brian Christiansen authored
Bug 2507
-
Morris Jette authored
Step GRES value changed from type "int" to "int64_t" to support larger values. Previous logic could fail in step allocation values over 32-bits. Other GRES values are 64-bit.
-
Danny Auble authored
slurmstepd to close potential open ones. It was pointed out the slurmd using acct_gather_energy/ipmi links to freeipmi which could possibly open /dev/ipmi0 without the close on exec flag set as root while launching a step leaving it open in the users app. What this does is sets the flag on the first 256 to mitigate the concern. Reported by Maksym Planeta. Bug 2506
-
- 02 Mar, 2016 4 commits
-
-
Gary B Skouson authored
Previous logic tested whatever the job's partition pointer indicated rather than the partition we are trying to run the job in. This bug was introduced in Slurm version 15.08.5, Nov 16, 2015, commit 94f0e948 bug 2499
-
Tim Wickberg authored
-
Thomas Cadeau authored
-
Morris Jette authored
Check that PowerSave mode configured for node_features/knl_cray plugin. It is required to reconfigure and reboot nodes. Fatal error if not configured.
-
- 01 Mar, 2016 4 commits
-
-
Tim Wickberg authored
src/common/mapping.h was the one place outside of slurm/*h that used this, just remove it from there. Replace macro with #ifdef __cplusplus in slurm/*h in case anyone is linking C++ against libslurm.
-
Tim Wickberg authored
Macro hasn't been used consistently for three+ years, and is protecting against compilation by non-ANSI C compilers which has not been a concern for quite some time. Cleanup formatting of function declarations while here. No change to logic.
-
Tim Wickberg authored
-
Morris Jette authored
Insure that a job is completely launched before trying to suspend it. Previous logic would start suspend logic early in the life of the slurmstepd process, after it's listening socket was open but before the tasks were launched. This defers the suspend logic until after all prologs and setup completes and the tasks are launched. This is important in the case of gang scheduling, in which newly launched jobs can be immediately suspended. bug 2494
-
- 29 Feb, 2016 1 commit
-
-
Tim Wickberg authored
Default value is 1. Weight is uint32_t so this check was always succeeding.
-
- 27 Feb, 2016 1 commit
-
-
Morris Jette authored
-
- 26 Feb, 2016 5 commits
-
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
-
Tim Wickberg authored
Add note to slurm.conf man page about setting "--cpu_bind=no" as part of SallocDefaultCommand if a TaskPlugin is in use.
-
Morris Jette authored
Revert call to getaddrinfo, restoring gethostbyaddr (introduced in Slurm 16.05.0pre1) which was failing on some systems. Specifically test7.2 was failing on some systems with getaddrinfo() returning an error of "System error: Resource temporarily unavailable". Partial reversion of commit 89621f65
-
- 25 Feb, 2016 2 commits
-
-
Danny Auble authored
was also given.
-
Morris Jette authored
Split partition's "Priority" field into "PriorityTier" (used to order partitions for scheduling and preemption) plus "PriorityJobFactor" (used by priority/multifactor plugin in calculating job priority, which is used to order jobs within a partition for scheduling). bug 2479
-
- 24 Feb, 2016 5 commits
-
-
Danny Auble authored
a partition.
-
Danny Auble authored
This also reverts most of commit fa331e30 as well as commit bd9fa830 which would try to set the pn_min_cpus every time a job was updated. If a job didn't request node counts then they were hosed. This commit takes away the magic which was screwing things up. Now the person gets what they asked for without magic changing things. Bug 2302 Bug 2742 Bug 2478
-
Danny Auble authored
erroneously.
-
Danny Auble authored
-
Danny Auble authored
-
- 23 Feb, 2016 1 commit
-
-
Danny Auble authored
This whole process could probably be done better by keeping track of old values and new values and only calling one function instead of a pre and post function, but that can probably wait for future generations of the code as it works now and is probably adequate for the time being. Bug 2352
-
- 22 Feb, 2016 1 commit
-
-
Morris Jette authored
-
- 19 Feb, 2016 1 commit
-
-
Morris Jette authored
BurstBuffer/cray - Defer job cancellation or time limit while "pre-run" operation in progress to avoid inconsistent state due to multiple calls to job termination functions. bug 2454
-