Commits · 31ac01ce9fe25156bda8177f7dd7875ff8c35f9d · Manuel G. Marciani / ces_slurm_simulator

02 Feb, 2016 2 commits

Morris Jette authored Feb 02, 2016

This fixes a bug introduced in commit a801d264
Whole node resource allocations with REPLACE option were not working.
Detected by test3.14 failure.

31ac01ce

Fix support for AuthInfo in slurmdbd.conf · fa4222ec

Didier GAZEN authored Feb 02, 2016

Support AuthInfo in slurmdbd.conf that is different from the value in
    slurm.conf.
There is a possible bug in the slurm_get_auth_info function (src/common/slurm_protocol_api.c) that can cause the slurmdbd daemon to look for the AuthInfo parameter in slurm.conf instead of slurmdbd.conf when the auth/munge authentication method is used (AuthType=auth/munge).

Here is the slurmdbd log revealing the problem (debug5() printing were added in the sources) :

slurmdbd: slurmdbd version 15.08.7 started
slurmdbd: debug2: running rollup at Tue Feb 02 14:20:14 2016
slurmdbd: debug5: in ../../../src/slurmdbd/slurmdbd.c, _send_slurmctld_register_req (line 690)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_send_node_msg (line 3601)
slurmdbd: debug5: in ../../../../../src/plugins/auth/munge/auth_munge.c, slurm_auth_create (line 217)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_get_auth_ttl (line 1732)
slurmdbd: debug5: Entering ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info
slurmdbd: debug:  Reading slurm.conf file: /usr/local/slurm-15-08-7-1/etc/slurm.conf
slurmdbd: error: s_p_parse_file: unable to status file /usr/local/slurm-15-08-7-1/etc/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
...

Then 60 seconds later, the auth_info value returned by slurm_get_auth_info is NULL:

slurmdbd: debug5: Leaving ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info, auth_info=(null)

and slurmdbd continues without crashing, but I am not sure it is in a safe state.

When applying this patch :

diff --git a/src/common/slurm_protocol_api.c b/src/common/slurm_protocol_api.c
index c5db879..be1dab6 100644
--- a/src/common/slurm_protocol_api.c
+++ b/src/common/slurm_protocol_api.c
@@ -1703,9 +1703,13 @@ extern char *slurm_get_auth_info(void)
        char *auth_info;
        slurm_ctl_conf_t *conf;

-       conf = slurm_conf_lock();
-       auth_info = xstrdup(conf->authinfo);
-       slurm_conf_unlock();
+       if (slurmdbd_conf) {
+                auth_info = xstrdup(slurmdbd_conf->auth_info);
+        } else {
+               conf = slurm_conf_lock();
+               auth_info = xstrdup(conf->authinfo);
+               slurm_conf_unlock();
+       }

        return auth_info;
 }

the auth_info value is now valid and consistent with the slurmdbd.conf setting:

slurmdbd: slurmdbd version 15.08.7 started
slurmdbd: debug2: running rollup at Tue Feb 02 14:47:37 2016
slurmdbd: debug5: in ../../../src/slurmdbd/slurmdbd.c, _send_slurmctld_register_req (line 690)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_send_node_msg (line 3600)
slurmdbd: debug5: in ../../../../../src/plugins/auth/munge/auth_munge.c, slurm_auth_create (line 217)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_get_auth_ttl (line 1731)
slurmdbd: debug5: Entering ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info
slurmdbd: debug5: Leaving ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info, auth_info=socket=/var/run/munge/munge_dbd.socket.2

fa4222ec

01 Feb, 2016 1 commit

update contributed cray config generator · 9f45e264

David Gloe authored Feb 01, 2016

contribs/cray/csm/slurmconfgen_smw.py - avoid erroneously including
repurposed compute nodes in the list of nodes to start slurmd.

9f45e264

29 Jan, 2016 2 commits

Fix double free of memory in slurmctld background mode · b1d88c7b

Morris Jette authored Jan 29, 2016

When the slurmctld is in background mode, it will issue double
  free calls on the incomming message buffers, likely leading
  an abort.

b1d88c7b

For for possibly invalid xfree · 4b5ecf40

Morris Jette authored Jan 29, 2016

If an invalid trigger message is received by slurmctld, it could
  result in a non-zero array counter and a NULL element array.
  If the element array is NULL, then clear the counter to avoid
  xfree calls of bad pointers.

4b5ecf40

28 Jan, 2016 5 commits

Don't relocated multi-node core reservations · a801d264

Morris Jette authored Jan 28, 2016

Do not automatically relocate an advanced reservation for individual cores
that spans multiple nodes when nodes in that reservation go down (e.g.
a 1 core reservation on node "tux1" will be moved if node "tux1" goes
down, but a reservation containing 2 cores on node "tux1" and 3 cores on
"tux2" will not be moved node "tux1" goes down). Advanced reservations for
whole nodes will be moved by default for down nodes.
bug 2326

a801d264

srun - check that found file is not a directory · 15c4bcf1

Tim Wickberg authored Jan 28, 2016

avoid attempting to execve() a directory with a name that
happens to matching that of the desired command. bug 2392.

15c4bcf1

Ignore a reserverations jobs when changing · b77666b5

Morris Jette authored Jan 28, 2016

Allow an existing reservation with running jobs to be modified without
    Flags=IGNORE_JOBS.
bug 2389

b77666b5

burst_buffer/cray - avoid overflow · 214b3abe

Morris Jette authored Jan 27, 2016

burst_buffer/cray - Increase size of intermediate variable used to store
    buffer byte size read from DW instance from 32 to 64-bits to avoid overflow
    and reporting invalid buffer sizes.
bug 2378

214b3abe

GRES - Fix minor typecast issues. · 6f94bb7f
Danny Auble authored Jan 27, 2016

6f94bb7f

27 Jan, 2016 6 commits
- Remove unneeded (verbose) debug message · 3c25beb4
  Danny Auble authored Jan 27, 2016
  
  3c25beb4
- Sanity Check Patch to setup variables for RAPL if in a race for it. · 1a1e74c3
  Danny Auble authored Jan 27, 2016
  
  1a1e74c3
- Move debug messages to debug3 notifying a gres_bit_alloc was NULL for · a2559f00
  Danny Auble authored Jan 27, 2016
```
gres types without a File.
```
  a2559f00
- Fix incorrect logic when querying assoc_mgr information. · 43ed2140
  Danny Auble authored Jan 27, 2016
  
  43ed2140
- Move debug messages like "not the right user" from association manager · 59ef2158
  Danny Auble authored Jan 27, 2016
```
to debug3 when trying to find the correct association.

a continuation to commit 87d9370f
```
  59ef2158
- Increase debug level to debug3 as function just iterates over associations · 87d9370f
  Alejandro Sanchez authored Jan 27, 2016
  
  87d9370f
26 Jan, 2016 1 commit

Add lock for newly changed boot logic · 4d897176

Morris Jette authored Jan 25, 2016

Both the original logic and modified logic failed to lock the job
data structure prior to decrementing "prolog_running" counter.

4d897176

25 Jan, 2016 4 commits
- Comment code to note why certain RPCs are handled differently than others · baded5e8
  Danny Auble authored Jan 25, 2016
```
in the forward message logic.
```
  baded5e8
- Launch job requesting reboot after the boot completes · 608421da
  Morris Jette authored Jan 25, 2016
```
Previously under some conditions that boot completion was ignored
and the job kept pending.
```
  608421da
- whitespace fix - tabs not spaces · 1743225c
  Tim Wickberg authored Jan 25, 2016
  
  1743225c
- Fix use of uninitialized variable in Perl API's slurm_job_step_get_pids · 03b254ef
  Sergey Meirovich authored Jan 25, 2016
  
  03b254ef
22 Jan, 2016 1 commit
- Remove unneeded line from NEWS · 2b2fd513
  Danny Auble authored Jan 22, 2016
  
  2b2fd513
21 Jan, 2016 11 commits
- MySQL - Fix querying jobs with reservations when the id's have rolled. · 201d691a
  Danny Auble authored Jan 21, 2016
```
Bug 2364
```
  201d691a
- Remove redundant logic when updating a job's task count. · ce3d6c8d
  Danny Auble authored Jan 21, 2016
```
Commit fa331e30 fixes this.  The logic was bad to begin with...

uint32_t new_cpus = detail_ptr->num_tasks
	/ detail_ptr->cpus_per_task;

The / should had been * this whole time.  This was the reason we found
this in the first place.
```
  ce3d6c8d
- Fix typo in slurm.conf man page · b23f2314
  Morris Jette authored Jan 21, 2016
```
bug 2369
```
  b23f2314
- Spelling fixes · 356a25c4
  Gennaro Oliva authored Jan 21, 2016
  
  356a25c4
- Add scancel delay if slow responses · d8b12b01
  Morris Jette authored Jan 21, 2016
```
If scancel is operating on large number of jobs and RPC responses from
    slurmctld daemon are slow then introduce a delay in sending the cancel job
    requests from scancel in order to reduce load on slurmctld.
bug 2256
```
  d8b12b01
- Fix test for delayed job launch · 456d498d
  Morris Jette authored Jan 21, 2016
```
If a job launch is delayed, the test was failing due to bad parsing.
These lines were being interpretted as a counter folloed by node
names of "queued" and "has":
  srun: job 1332712 queued and waiting for resources
  srun: job 1332712 has been allocated resources
```
  456d498d
- Fix typo of "childern" · 9cc9985e
  Morris Jette authored Jan 21, 2016
  
  9cc9985e
- Modify error message when MaxJobCount reached · f11ec171
  Morris Jette authored Jan 21, 2016
```
bug 2366
```
  f11ec171
- Make it so daemons also support TopologyParam=NoInAddrAny · 4b9cf731
  Danny Auble authored Jan 20, 2016
  
  4b9cf731
- Backfill sync with Cray NHC · 79a21bd6
  Morris Jette authored Jan 20, 2016
```
Backfill scheduling properly synchronized with Cray Node Health Check.
    Prior logic could result in highest priority job getting improperly
    postponed.
bug 2350
```
  79a21bd6
- add lines in news for next tag 15.08.8 · d461b724
  Danny Auble authored Jan 20, 2016
  
  d461b724
20 Jan, 2016 7 commits
- New tag for 15.08.7 · be18995e
  Danny Auble authored Jan 20, 2016
  
  be18995e
- Add missing job states from the qstat wrapper. · 72355b73
  Danny Auble authored Jan 20, 2016
  
  72355b73
- Strip flags from a job state in qstat wrapper before evaluating. · 24732902
  Danny Auble authored Jan 20, 2016
  
  24732902
- Expand memory leak test description · fa63067f
  Morris Jette authored Jan 20, 2016
  
  fa63067f
- Correct job_cnt_run NULL pointer · 207adf8e
  Morris Jette authored Jan 20, 2016
```
This corrects logic from commit e5a61746
that could result in use of NULL pointer
```
  207adf8e
- Prevent job_cnt_run · f76586bf
  Morris Jette authored Jan 19, 2016
```
It was previously triggered by executing "scontrol reconfig" on a
  front-end system while there was a job in completing state.
```
  f76586bf
- Properly track resources for suspended jobs on reconfig · 21c52d2f
  Morris Jette authored Jan 19, 2016
```
Properly account for memory, CPUs and GRES when slurmctld is reconfigured
    while there is a suspended job. Previous logic would add the CPUs, but not
    memory or GPUs. This would result in underflow/overflow errors in select
    cons_res plugin.
bug 2353
```
  21c52d2f