- 02 Mar, 2016 2 commits
-
-
Gary B Skouson authored
Previous logic tested whatever the job's partition pointer indicated rather than the partition we are trying to run the job in. This bug was introduced in Slurm version 15.08.5, Nov 16, 2015, commit 94f0e948 bug 2499
-
Thomas Cadeau authored
-
- 01 Mar, 2016 2 commits
-
-
Tim Wickberg authored
-
Morris Jette authored
Insure that a job is completely launched before trying to suspend it. Previous logic would start suspend logic early in the life of the slurmstepd process, after it's listening socket was open but before the tasks were launched. This defers the suspend logic until after all prologs and setup completes and the tasks are launched. This is important in the case of gang scheduling, in which newly launched jobs can be immediately suspended. bug 2494
-
- 26 Feb, 2016 2 commits
-
-
Danny Auble authored
-
Tim Wickberg authored
Add note to slurm.conf man page about setting "--cpu_bind=no" as part of SallocDefaultCommand if a TaskPlugin is in use.
-
- 25 Feb, 2016 1 commit
-
-
Danny Auble authored
was also given.
-
- 24 Feb, 2016 5 commits
-
-
Danny Auble authored
a partition.
-
Danny Auble authored
This also reverts most of commit fa331e30 as well as commit bd9fa830 which would try to set the pn_min_cpus every time a job was updated. If a job didn't request node counts then they were hosed. This commit takes away the magic which was screwing things up. Now the person gets what they asked for without magic changing things. Bug 2302 Bug 2742 Bug 2478
-
Danny Auble authored
erroneously.
-
Danny Auble authored
-
Danny Auble authored
-
- 23 Feb, 2016 1 commit
-
-
Danny Auble authored
This whole process could probably be done better by keeping track of old values and new values and only calling one function instead of a pre and post function, but that can probably wait for future generations of the code as it works now and is probably adequate for the time being. Bug 2352
-
- 19 Feb, 2016 2 commits
-
-
Morris Jette authored
BurstBuffer/cray - Defer job cancellation or time limit while "pre-run" operation in progress to avoid inconsistent state due to multiple calls to job termination functions. bug 2454
-
Morris Jette authored
-
- 18 Feb, 2016 5 commits
-
-
Danny Auble authored
a new account and making it a default all at once. Bug 2428
-
Alejandro Sanchez authored
Match acct_gather_energy/rapl plugin. Bug 2397.
-
Tim Wickberg authored
Control whether the scheduler will continue to try to run jobs in a partition if a higher priority job is stuck due to an association limit. Can cause starvation for larger jobs, but will improve throughput and utilization for systems that have extensively divvyed up their resources through association/QOS limits. Bug 2388 and 2452.
-
Danny Auble authored
Bug 2453
-
Tim Wickberg authored
have been selected set the time limit appropriately if the job didn't request one. If the partition has no DefaultTime setting, and no time_limit was given for the job, job_ptr->time_limit == NO_VAL. With AccountingStorageEnforce=safe this will prevent jobs from ever starting if the association has any limit set for CPUMins. (NO_VAL * cpus is a very large number, but if no time_limit is given anywhere that is what they get :)) Bug 2388.
-
- 17 Feb, 2016 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
a parsing option. Dynamically set columns widths based on largest number in column.
-
- 16 Feb, 2016 2 commits
-
-
Danny Auble authored
accounting.
-
Danny Auble authored
period. This would only hit you if you rerolled a 15.08 prior to this commit.
-
- 12 Feb, 2016 1 commit
-
-
Tim Wickberg authored
Compare the saved clustername to slurmctld's, if there's a mismatch prevent slurmctld from starting to avoid corruption from separate clusters attempting to share a single state directory. Bug 2433.
-
- 10 Feb, 2016 3 commits
-
-
Brian Christiansen authored
Bug 2427 Reproduced with: sacctmgr show user withassoc user= user=<username>
-
Morris Jette authored
Add new PowerParameters options of get_timeout and set_timeout. The default set_timeout was increased from 5 seconds to 30 seconds. Also re-read current power caps periodically or after any failed "set" operation. bug 2332
-
Tim Wickberg authored
Build should work on non-glibc distributions by better POSIX conformance.
-
- 09 Feb, 2016 2 commits
-
-
Danny Auble authored
structure.
-
Alejandro Sanchez authored
-
- 08 Feb, 2016 1 commit
-
-
Danny Auble authored
-
- 04 Feb, 2016 1 commit
-
-
Brian Christiansen authored
Bug 2406 Continuations of a0bb9c1e and 4b0fc9e8
-
- 03 Feb, 2016 2 commits
-
-
Yu Watanabe authored
-
Danny Auble authored
-
- 02 Feb, 2016 3 commits
-
-
Tim Wickberg authored
Use PRIu64 instead of %ld for uint64_t types (libsh5util_old), %zu instead of PRIu64 for size_t.
-
Tim Wickberg authored
-
Didier GAZEN authored
Support AuthInfo in slurmdbd.conf that is different from the value in slurm.conf. There is a possible bug in the slurm_get_auth_info function (src/common/slurm_protocol_api.c) that can cause the slurmdbd daemon to look for the AuthInfo parameter in slurm.conf instead of slurmdbd.conf when the auth/munge authentication method is used (AuthType=auth/munge). Here is the slurmdbd log revealing the problem (debug5() printing were added in the sources) : slurmdbd: slurmdbd version 15.08.7 started slurmdbd: debug2: running rollup at Tue Feb 02 14:20:14 2016 slurmdbd: debug5: in ../../../src/slurmdbd/slurmdbd.c, _send_slurmctld_register_req (line 690) slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_send_node_msg (line 3601) slurmdbd: debug5: in ../../../../../src/plugins/auth/munge/auth_munge.c, slurm_auth_create (line 217) slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_get_auth_ttl (line 1732) slurmdbd: debug5: Entering ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info slurmdbd: debug: Reading slurm.conf file: /usr/local/slurm-15-08-7-1/etc/slurm.conf slurmdbd: error: s_p_parse_file: unable to status file /usr/local/slurm-15-08-7-1/etc/slurm.conf: No such file or directory, retrying in 1sec up to 60sec ... Then 60 seconds later, the auth_info value returned by slurm_get_auth_info is NULL: slurmdbd: debug5: Leaving ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info, auth_info=(null) and slurmdbd continues without crashing, but I am not sure it is in a safe state. When applying this patch : diff --git a/src/common/slurm_protocol_api.c b/src/common/slurm_protocol_api.c index c5db879..be1dab6 100644 --- a/src/common/slurm_protocol_api.c +++ b/src/common/slurm_protocol_api.c @@ -1703,9 +1703,13 @@ extern char *slurm_get_auth_info(void) char *auth_info; slurm_ctl_conf_t *conf; - conf = slurm_conf_lock(); - auth_info = xstrdup(conf->authinfo); - slurm_conf_unlock(); + if (slurmdbd_conf) { + auth_info = xstrdup(slurmdbd_conf->auth_info); + } else { + conf = slurm_conf_lock(); + auth_info = xstrdup(conf->authinfo); + slurm_conf_unlock(); + } return auth_info; } the auth_info value is now valid and consistent with the slurmdbd.conf setting: slurmdbd: slurmdbd version 15.08.7 started slurmdbd: debug2: running rollup at Tue Feb 02 14:47:37 2016 slurmdbd: debug5: in ../../../src/slurmdbd/slurmdbd.c, _send_slurmctld_register_req (line 690) slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_send_node_msg (line 3600) slurmdbd: debug5: in ../../../../../src/plugins/auth/munge/auth_munge.c, slurm_auth_create (line 217) slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_get_auth_ttl (line 1731) slurmdbd: debug5: Entering ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info slurmdbd: debug5: Leaving ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info, auth_info=socket=/var/run/munge/munge_dbd.socket.2
-
- 01 Feb, 2016 1 commit
-
-
David Gloe authored
contribs/cray/csm/slurmconfgen_smw.py - avoid erroneously including repurposed compute nodes in the list of nodes to start slurmd.
-
- 29 Jan, 2016 1 commit
-
-
Morris Jette authored
When the slurmctld is in background mode, it will issue double free calls on the incomming message buffers, likely leading an abort.
-
- 28 Jan, 2016 1 commit
-
-
Morris Jette authored
Do not automatically relocate an advanced reservation for individual cores that spans multiple nodes when nodes in that reservation go down (e.g. a 1 core reservation on node "tux1" will be moved if node "tux1" goes down, but a reservation containing 2 cores on node "tux1" and 3 cores on "tux2" will not be moved node "tux1" goes down). Advanced reservations for whole nodes will be moved by default for down nodes. bug 2326
-