- 02 Feb, 2016 1 commit
-
-
Didier GAZEN authored
Support AuthInfo in slurmdbd.conf that is different from the value in slurm.conf. There is a possible bug in the slurm_get_auth_info function (src/common/slurm_protocol_api.c) that can cause the slurmdbd daemon to look for the AuthInfo parameter in slurm.conf instead of slurmdbd.conf when the auth/munge authentication method is used (AuthType=auth/munge). Here is the slurmdbd log revealing the problem (debug5() printing were added in the sources) : slurmdbd: slurmdbd version 15.08.7 started slurmdbd: debug2: running rollup at Tue Feb 02 14:20:14 2016 slurmdbd: debug5: in ../../../src/slurmdbd/slurmdbd.c, _send_slurmctld_register_req (line 690) slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_send_node_msg (line 3601) slurmdbd: debug5: in ../../../../../src/plugins/auth/munge/auth_munge.c, slurm_auth_create (line 217) slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_get_auth_ttl (line 1732) slurmdbd: debug5: Entering ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info slurmdbd: debug: Reading slurm.conf file: /usr/local/slurm-15-08-7-1/etc/slurm.conf slurmdbd: error: s_p_parse_file: unable to status file /usr/local/slurm-15-08-7-1/etc/slurm.conf: No such file or directory, retrying in 1sec up to 60sec ... Then 60 seconds later, the auth_info value returned by slurm_get_auth_info is NULL: slurmdbd: debug5: Leaving ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info, auth_info=(null) and slurmdbd continues without crashing, but I am not sure it is in a safe state. When applying this patch : diff --git a/src/common/slurm_protocol_api.c b/src/common/slurm_protocol_api.c index c5db879..be1dab6 100644 --- a/src/common/slurm_protocol_api.c +++ b/src/common/slurm_protocol_api.c @@ -1703,9 +1703,13 @@ extern char *slurm_get_auth_info(void) char *auth_info; slurm_ctl_conf_t *conf; - conf = slurm_conf_lock(); - auth_info = xstrdup(conf->authinfo); - slurm_conf_unlock(); + if (slurmdbd_conf) { + auth_info = xstrdup(slurmdbd_conf->auth_info); + } else { + conf = slurm_conf_lock(); + auth_info = xstrdup(conf->authinfo); + slurm_conf_unlock(); + } return auth_info; } the auth_info value is now valid and consistent with the slurmdbd.conf setting: slurmdbd: slurmdbd version 15.08.7 started slurmdbd: debug2: running rollup at Tue Feb 02 14:47:37 2016 slurmdbd: debug5: in ../../../src/slurmdbd/slurmdbd.c, _send_slurmctld_register_req (line 690) slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_send_node_msg (line 3600) slurmdbd: debug5: in ../../../../../src/plugins/auth/munge/auth_munge.c, slurm_auth_create (line 217) slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_get_auth_ttl (line 1731) slurmdbd: debug5: Entering ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info slurmdbd: debug5: Leaving ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info, auth_info=socket=/var/run/munge/munge_dbd.socket.2
-
- 01 Feb, 2016 1 commit
-
-
David Gloe authored
contribs/cray/csm/slurmconfgen_smw.py - avoid erroneously including repurposed compute nodes in the list of nodes to start slurmd.
-
- 29 Jan, 2016 2 commits
-
-
Morris Jette authored
When the slurmctld is in background mode, it will issue double free calls on the incomming message buffers, likely leading an abort.
-
Morris Jette authored
If an invalid trigger message is received by slurmctld, it could result in a non-zero array counter and a NULL element array. If the element array is NULL, then clear the counter to avoid xfree calls of bad pointers.
-
- 28 Jan, 2016 5 commits
-
-
Morris Jette authored
Do not automatically relocate an advanced reservation for individual cores that spans multiple nodes when nodes in that reservation go down (e.g. a 1 core reservation on node "tux1" will be moved if node "tux1" goes down, but a reservation containing 2 cores on node "tux1" and 3 cores on "tux2" will not be moved node "tux1" goes down). Advanced reservations for whole nodes will be moved by default for down nodes. bug 2326
-
Tim Wickberg authored
avoid attempting to execve() a directory with a name that happens to matching that of the desired command. bug 2392.
-
Morris Jette authored
Allow an existing reservation with running jobs to be modified without Flags=IGNORE_JOBS. bug 2389
-
Morris Jette authored
burst_buffer/cray - Increase size of intermediate variable used to store buffer byte size read from DW instance from 32 to 64-bits to avoid overflow and reporting invalid buffer sizes. bug 2378
-
Danny Auble authored
-
- 27 Jan, 2016 6 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
gres types without a File.
-
Danny Auble authored
-
Danny Auble authored
to debug3 when trying to find the correct association. a continuation to commit 87d9370f
-
Alejandro Sanchez authored
-
- 26 Jan, 2016 1 commit
-
-
Morris Jette authored
Both the original logic and modified logic failed to lock the job data structure prior to decrementing "prolog_running" counter.
-
- 25 Jan, 2016 4 commits
-
-
Danny Auble authored
in the forward message logic.
-
Morris Jette authored
Previously under some conditions that boot completion was ignored and the job kept pending.
-
Tim Wickberg authored
-
Sergey Meirovich authored
-
- 22 Jan, 2016 1 commit
-
-
Danny Auble authored
-
- 21 Jan, 2016 11 commits
-
-
Danny Auble authored
Bug 2364
-
Danny Auble authored
Commit fa331e30 fixes this. The logic was bad to begin with... uint32_t new_cpus = detail_ptr->num_tasks / detail_ptr->cpus_per_task; The / should had been * this whole time. This was the reason we found this in the first place.
-
Morris Jette authored
bug 2369
-
Gennaro Oliva authored
-
Morris Jette authored
If scancel is operating on large number of jobs and RPC responses from slurmctld daemon are slow then introduce a delay in sending the cancel job requests from scancel in order to reduce load on slurmctld. bug 2256
-
Morris Jette authored
If a job launch is delayed, the test was failing due to bad parsing. These lines were being interpretted as a counter folloed by node names of "queued" and "has": srun: job 1332712 queued and waiting for resources srun: job 1332712 has been allocated resources
-
Morris Jette authored
-
Morris Jette authored
bug 2366
-
Danny Auble authored
-
Morris Jette authored
Backfill scheduling properly synchronized with Cray Node Health Check. Prior logic could result in highest priority job getting improperly postponed. bug 2350
-
Danny Auble authored
-
- 20 Jan, 2016 8 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
This corrects logic from commit e5a61746 that could result in use of NULL pointer
-
Morris Jette authored
It was previously triggered by executing "scontrol reconfig" on a front-end system while there was a job in completing state.
-
Morris Jette authored
Properly account for memory, CPUs and GRES when slurmctld is reconfigured while there is a suspended job. Previous logic would add the CPUs, but not memory or GPUs. This would result in underflow/overflow errors in select cons_res plugin. bug 2353
-
Morris Jette authored
The counter is really intended to reflect the count of running or suspended jobs rather than running jobs alone. Previous logic would report an underflow for the "job_cnt_run" variable if 1. job submitted 2. job suspended 3. scontrol reconfig 4. job cancelled
-