Commits · b717bf5cbc222e773768e15f41d1a27c1492d239 · Manuel G. Marciani / ces_slurm_simulator

02 Feb, 2016 3 commits

Fix build for sh5util on ppc64 by replacing printf formatters · b717bf5c
Tim Wickberg authored Feb 02, 2016
```
Use PRIu64 instead of %ld for uint64_t types (libsh5util_old),
%zu instead of PRIu64 for size_t.
```
b717bf5c
update NEWS to mention FreeBSD fixes · e9982fa4
Tim Wickberg authored Feb 02, 2016

e9982fa4

Fix support for AuthInfo in slurmdbd.conf · fa4222ec

Didier GAZEN authored Feb 02, 2016

Support AuthInfo in slurmdbd.conf that is different from the value in
    slurm.conf.
There is a possible bug in the slurm_get_auth_info function (src/common/slurm_protocol_api.c) that can cause the slurmdbd daemon to look for the AuthInfo parameter in slurm.conf instead of slurmdbd.conf when the auth/munge authentication method is used (AuthType=auth/munge).

Here is the slurmdbd log revealing the problem (debug5() printing were added in the sources) :

slurmdbd: slurmdbd version 15.08.7 started
slurmdbd: debug2: running rollup at Tue Feb 02 14:20:14 2016
slurmdbd: debug5: in ../../../src/slurmdbd/slurmdbd.c, _send_slurmctld_register_req (line 690)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_send_node_msg (line 3601)
slurmdbd: debug5: in ../../../../../src/plugins/auth/munge/auth_munge.c, slurm_auth_create (line 217)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_get_auth_ttl (line 1732)
slurmdbd: debug5: Entering ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info
slurmdbd: debug:  Reading slurm.conf file: /usr/local/slurm-15-08-7-1/etc/slurm.conf
slurmdbd: error: s_p_parse_file: unable to status file /usr/local/slurm-15-08-7-1/etc/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
...

Then 60 seconds later, the auth_info value returned by slurm_get_auth_info is NULL:

slurmdbd: debug5: Leaving ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info, auth_info=(null)

and slurmdbd continues without crashing, but I am not sure it is in a safe state.

When applying this patch :

diff --git a/src/common/slurm_protocol_api.c b/src/common/slurm_protocol_api.c
index c5db879..be1dab6 100644
--- a/src/common/slurm_protocol_api.c
+++ b/src/common/slurm_protocol_api.c
@@ -1703,9 +1703,13 @@ extern char *slurm_get_auth_info(void)
        char *auth_info;
        slurm_ctl_conf_t *conf;

-       conf = slurm_conf_lock();
-       auth_info = xstrdup(conf->authinfo);
-       slurm_conf_unlock();
+       if (slurmdbd_conf) {
+                auth_info = xstrdup(slurmdbd_conf->auth_info);
+        } else {
+               conf = slurm_conf_lock();
+               auth_info = xstrdup(conf->authinfo);
+               slurm_conf_unlock();
+       }

        return auth_info;
 }

the auth_info value is now valid and consistent with the slurmdbd.conf setting:

slurmdbd: slurmdbd version 15.08.7 started
slurmdbd: debug2: running rollup at Tue Feb 02 14:47:37 2016
slurmdbd: debug5: in ../../../src/slurmdbd/slurmdbd.c, _send_slurmctld_register_req (line 690)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_send_node_msg (line 3600)
slurmdbd: debug5: in ../../../../../src/plugins/auth/munge/auth_munge.c, slurm_auth_create (line 217)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_get_auth_ttl (line 1731)
slurmdbd: debug5: Entering ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info
slurmdbd: debug5: Leaving ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info, auth_info=socket=/var/run/munge/munge_dbd.socket.2

fa4222ec

01 Feb, 2016 1 commit

update contributed cray config generator · 9f45e264

David Gloe authored Feb 01, 2016

contribs/cray/csm/slurmconfgen_smw.py - avoid erroneously including
repurposed compute nodes in the list of nodes to start slurmd.

9f45e264

29 Jan, 2016 1 commit

Fix double free of memory in slurmctld background mode · b1d88c7b

Morris Jette authored Jan 29, 2016

When the slurmctld is in background mode, it will issue double
  free calls on the incomming message buffers, likely leading
  an abort.

b1d88c7b

28 Jan, 2016 5 commits

Don't relocated multi-node core reservations · a801d264

Morris Jette authored Jan 28, 2016

Do not automatically relocate an advanced reservation for individual cores
that spans multiple nodes when nodes in that reservation go down (e.g.
a 1 core reservation on node "tux1" will be moved if node "tux1" goes
down, but a reservation containing 2 cores on node "tux1" and 3 cores on
"tux2" will not be moved node "tux1" goes down). Advanced reservations for
whole nodes will be moved by default for down nodes.
bug 2326

a801d264

srun - check that found file is not a directory · 15c4bcf1

Tim Wickberg authored Jan 28, 2016

avoid attempting to execve() a directory with a name that
happens to matching that of the desired command. bug 2392.

15c4bcf1

Ignore a reserverations jobs when changing · b77666b5

Morris Jette authored Jan 28, 2016

Allow an existing reservation with running jobs to be modified without
    Flags=IGNORE_JOBS.
bug 2389

b77666b5

burst_buffer/cray - avoid overflow · 214b3abe

Morris Jette authored Jan 27, 2016

burst_buffer/cray - Increase size of intermediate variable used to store
    buffer byte size read from DW instance from 32 to 64-bits to avoid overflow
    and reporting invalid buffer sizes.
bug 2378

214b3abe

GRES - Fix minor typecast issues. · 6f94bb7f
Danny Auble authored Jan 27, 2016

6f94bb7f

27 Jan, 2016 5 commits
- Sanity Check Patch to setup variables for RAPL if in a race for it. · 1a1e74c3
  Danny Auble authored Jan 27, 2016
  
  1a1e74c3
- Move debug messages to debug3 notifying a gres_bit_alloc was NULL for · a2559f00
  Danny Auble authored Jan 27, 2016
```
gres types without a File.
```
  a2559f00
- Fix incorrect logic when querying assoc_mgr information. · 43ed2140
  Danny Auble authored Jan 27, 2016
  
  43ed2140
- Move debug messages like "not the right user" from association manager · 59ef2158
  Danny Auble authored Jan 27, 2016
```
to debug3 when trying to find the correct association.

a continuation to commit 87d9370f
```
  59ef2158
- Increase debug level to debug3 as function just iterates over associations · 87d9370f
  Alejandro Sanchez authored Jan 27, 2016
  
  87d9370f
25 Jan, 2016 2 commits
- Launch job requesting reboot after the boot completes · 608421da
  Morris Jette authored Jan 25, 2016
```
Previously under some conditions that boot completion was ignored
and the job kept pending.
```
  608421da
- Fix use of uninitialized variable in Perl API's slurm_job_step_get_pids · 03b254ef
  Sergey Meirovich authored Jan 25, 2016
  
  03b254ef
22 Jan, 2016 1 commit
- Remove unneeded line from NEWS · 2b2fd513
  Danny Auble authored Jan 22, 2016
  
  2b2fd513
21 Jan, 2016 7 commits
- MySQL - Fix querying jobs with reservations when the id's have rolled. · 201d691a
  Danny Auble authored Jan 21, 2016
```
Bug 2364
```
  201d691a
- Remove redundant logic when updating a job's task count. · ce3d6c8d
  Danny Auble authored Jan 21, 2016
```
Commit fa331e30 fixes this.  The logic was bad to begin with...

uint32_t new_cpus = detail_ptr->num_tasks
	/ detail_ptr->cpus_per_task;

The / should had been * this whole time.  This was the reason we found
this in the first place.
```
  ce3d6c8d
- Add scancel delay if slow responses · d8b12b01
  Morris Jette authored Jan 21, 2016
```
If scancel is operating on large number of jobs and RPC responses from
    slurmctld daemon are slow then introduce a delay in sending the cancel job
    requests from scancel in order to reduce load on slurmctld.
bug 2256
```
  d8b12b01
- Fix typo of "childern" · 9cc9985e
  Morris Jette authored Jan 21, 2016
  
  9cc9985e
- Make it so daemons also support TopologyParam=NoInAddrAny · 4b9cf731
  Danny Auble authored Jan 20, 2016
  
  4b9cf731
- Backfill sync with Cray NHC · 79a21bd6
  Morris Jette authored Jan 20, 2016
```
Backfill scheduling properly synchronized with Cray Node Health Check.
    Prior logic could result in highest priority job getting improperly
    postponed.
bug 2350
```
  79a21bd6
- add lines in news for next tag 15.08.8 · d461b724
  Danny Auble authored Jan 20, 2016
  
  d461b724
20 Jan, 2016 3 commits
- Add missing job states from the qstat wrapper. · 72355b73
  Danny Auble authored Jan 20, 2016
  
  72355b73
- Strip flags from a job state in qstat wrapper before evaluating. · 24732902
  Danny Auble authored Jan 20, 2016
  
  24732902
- Properly track resources for suspended jobs on reconfig · 21c52d2f
  Morris Jette authored Jan 19, 2016
```
Properly account for memory, CPUs and GRES when slurmctld is reconfigured
    while there is a suspended job. Previous logic would add the CPUs, but not
    memory or GPUs. This would result in underflow/overflow errors in select
    cons_res plugin.
bug 2353
```
  21c52d2f
17 Jan, 2016 1 commit

jette authored Jan 16, 2016

Fix backfill scheduling bug which could postpone the scheduling of jobs due
    to avoidance of nodes in COMPLETING state.
bug 2350

1a4b5983

15 Jan, 2016 3 commits
- Prevent decrementing of TRESRunMins when AccountingStorageEnforce=limits is not set. · 47c22b2e
  Brian Christiansen authored Jan 15, 2016
```
Bug 2255
```
  47c22b2e
- Fix memory leak in slurmctld job array logic · 38ca2e67
  Morris Jette authored Jan 15, 2016
  
  38ca2e67
- Fix nodes from being overallocated when allocation straddles multiple nodes. · fb604c7e
  Brian Christiansen authored Jan 14, 2016
```
Bug 2343
```
  fb604c7e
14 Jan, 2016 2 commits

fix AuthInfo with alternate munge socket location · f3d54f99

Morris Jette authored Jan 14, 2016

Fix for configuration of "AuthType=munge" and "AuthInfo=socket=..." with
    alternate munge socket path.
bug 2348

f3d54f99

Avoid slurmstepd abort if malloc fails for accounting · d5400aa5

Morris Jette authored Jan 13, 2016

If a node is out of memory, then the malloc performed by slurmstepd
  periodically may fail, killing the slurmstepd and orphaning it's
  processes.
bug 2341

d5400aa5

13 Jan, 2016 2 commits

backfill scheduling with group limits fix · 3ee1632f

Morris Jette authored Jan 13, 2016

Backfill scheduling fix: If a job can't be started due to a "group" resource
limit, rather than reserve resources for it when the next job ends, don't
reserve any resources for it. The problem with the original logic is that
if a lot of resources are reserved for such pending jobs, then jobs futher
down the queue may defered when they really can and should be started. An
ideal solution would track all of the TRES resources through time as jobs
start and end, but we don't have that logic in the backfill scheduler and
don't want that extra overhead in the backfill scheduler.
bugs 2326 and 2282

3ee1632f

Add more partition info to "scontrol write config" · f428705b
Alejandro Sanchez authored Jan 13, 2016
```
bug 2303
```
f428705b

12 Jan, 2016 4 commits

Cleanup slurm_sprint_partition_info / slurm_sprint_reservation_info · 581e811c

Tim Wickberg authored Jan 12, 2016

Handle unexpectedly large lines for hostlists. (Bug 2333.)

While here rework to avoid extraneous xstrcat calls by using
xstrfmtcat instead of snprintf + xstrcat. Collapse line end into
own string for readability.

No performance or functional change, aside from removing possible
line truncation (which will silence additional Coverity warnings).

Removes a double xfree() in slurm_sprint_reservation_info().

581e811c

Compress reservation host names · 30a8150c

Morris Jette authored Jan 12, 2016

When a reservation is created or updated, compress user provided node names
    using hostlist functions (e.g. translate user input of "Nodes=tux1,tux2"
    into "Nodes=tux[1-2]").
bug 2333

30a8150c

`qstat -u` should return zero even with no records printed · addb3574
Tim Wickberg authored Jan 12, 2016
```
Match behavior of other PBS-like resource managers. Bug 2330.
```
addb3574
Initial commit to add TRES to creating a reservation · 7eff526c
Alejandro Sanchez authored Jan 11, 2016

7eff526c