Commits · efd9d35eb3ab2c75665beff05a74294b27d94f75 · Manuel G. Marciani / ces_slurm_simulator

02 Mar, 2016 2 commits

Backfill scheduler to validate correct job partition · efd9d35e

Gary B Skouson authored Mar 02, 2016

Previous logic tested whatever the job's partition pointer indicated
rather than the partition we are trying to run the job in. This bug
was introduced in Slurm version 15.08.5, Nov 16, 2015, commit
94f0e948
bug 2499

efd9d35e

Remove a duplicate xmalloc · 2d5066e7
Thomas Cadeau authored Feb 26, 2016

2d5066e7

01 Mar, 2016 2 commits

Update NEWS as well. · a058ff4a
Tim Wickberg authored Mar 01, 2016

a058ff4a

Defer suspend until launch completes · 52fe3de1

Morris Jette authored Feb 29, 2016

Insure that a job is completely launched before trying to suspend it.
Previous logic would start suspend logic early in the life of the
slurmstepd process, after it's listening socket was open but before
the tasks were launched. This defers the suspend logic until after
all prologs and setup completes and the tasks are launched. This is
important in the case of gang scheduling, in which newly launched
jobs can be immediately suspended.
bug 2494

52fe3de1

26 Feb, 2016 2 commits
- Set correct reason when a QOS' MaxTresMins is violated. · 745568f2
  Danny Auble authored Feb 26, 2016
  
  745568f2
- Add not to slurm.conf man page about SallocDefaultCommand and TaskPlugins. · b5b349b0
  Tim Wickberg authored Feb 25, 2016
```
Add note to slurm.conf man page about setting "--cpu_bind=no" as part
of SallocDefaultCommand if a TaskPlugin is in use.
```
  b5b349b0
25 Feb, 2016 1 commit
- Fix issue where SocketsPerBoard didn't translate to Sockets when CPUS= · fcae2193
  Danny Auble authored Feb 24, 2016
```
was also given.
```
  fcae2193
24 Feb, 2016 5 commits
- Make it so scontrol update part qos= will take away a partition QOS from · 3a7470ae
  Danny Auble authored Feb 24, 2016
```
a partition.
```
  3a7470ae
- Make it possible to change CPUsPerTask with scontrol. · de28c13a
  Danny Auble authored Feb 24, 2016
```
This also reverts most of commit fa331e30 as well as commit bd9fa830
which would try to set the pn_min_cpus every time a job was updated.
If a job didn't request node counts then they were hosed.

This commit takes away the magic which was screwing things up.  Now the
person gets what they asked for without magic changing things.

Bug 2302
Bug 2742
Bug 2478
```
  de28c13a
- Fix issue where when updating a job the pn_min_cpus was updated · bd9fa830
  Danny Auble authored Feb 24, 2016
```
erroneously.
```
  bd9fa830
- BGQ - Tighter locks around structures when nodes/cables change state. · c5925f41
  Danny Auble authored Feb 23, 2016
  
  c5925f41
- BGQ - Remove redeclaration of job_read_lock. · fd3dedda
  Danny Auble authored Feb 23, 2016
  
  fd3dedda
23 Feb, 2016 1 commit

Fix issue with resizing jobs and limits not be kept track of correctly. · 92ac0dcd

Danny Auble authored Feb 22, 2016

This whole process could probably be done better by keeping track of
old values and new values and only calling one function instead of a
pre and post function, but that can probably wait for future generations
of the code as it works now and is probably adequate for the time being.

Bug 2352

92ac0dcd

19 Feb, 2016 2 commits

BurstBuffer/cray pre-run race condtition fix · e8959ae9

Morris Jette authored Feb 19, 2016

BurstBuffer/cray - Defer job cancellation or time limit while "pre-run"
    operation in progress to avoid inconsistent state due to multiple calls
    to job termination functions.
bug 2454

e8959ae9

Update NEWS for start of v15.08.9 · 4aa2e3c2
Morris Jette authored Feb 18, 2016

4aa2e3c2

18 Feb, 2016 5 commits

MYSQL - Avoid having multiple default accounts when a user is added to · eafc3f71
Danny Auble authored Feb 18, 2016
```
a new account and making it a default all at once.

Bug 2428
```
eafc3f71
Prevent segfault in acct_gather_energy_cray if data requested before set. · c6771b1e
Alejandro Sanchez authored Feb 18, 2016
```
Match acct_gather_energy/rapl plugin. Bug 2397.
```
c6771b1e

Add new assoc_limit_continue option to SchedulerParameters · 0615d809

Tim Wickberg authored Feb 18, 2016

Control whether the scheduler will continue to try to run jobs in
a partition if a higher priority job is stuck due to an association
limit.

Can cause starvation for larger jobs, but will improve throughput and
utilization for systems that have extensively divvyed up their resources
through association/QOS limits.

Bug 2388 and 2452.

0615d809

Fix inadequate locks when updating a partition's TRES. · b223e52c
Danny Auble authored Feb 18, 2016
```
Bug 2453
```
b223e52c

When determine if a job can fit into a TRES time limit after resources · 244b34e0

Tim Wickberg authored Feb 17, 2016

have been selected set the time limit appropriately if the job didn't
request one.

If the partition has no DefaultTime setting, and no time_limit was given
for the job, job_ptr->time_limit == NO_VAL.

With AccountingStorageEnforce=safe this will prevent jobs from ever
starting if the association has any limit set for CPUMins.
(NO_VAL * cpus is a very large number, but if no time_limit is given
anywhere that is what they get :))

Bug 2388.

244b34e0

17 Feb, 2016 2 commits
- Make sure count of boards are restored when slurmctld has option -R. · 0ac68394
  Danny Auble authored Feb 17, 2016
  
  0ac68394
- Fix sreport's truncation of columns with large TRES and not using · 39095f0d
  Danny Auble authored Feb 16, 2016
```
a parsing option.

Dynamically set columns widths based on largest number in column.
```
  39095f0d
16 Feb, 2016 2 commits
- If FastSchedule=0 is used make sure TRES are set up correctly in · 120b3e6f
  Danny Auble authored Feb 16, 2016
```
accounting.
```
  120b3e6f
- MYSQL - Fix issue when rerolling monthly data to work off correct time · 3ff436c0
  Danny Auble authored Feb 11, 2016
```
period.  This would only hit you if you rerolled a 15.08 prior to this
commit.
```
  3ff436c0
12 Feb, 2016 1 commit

Store ClusterName under state directory, use to avoid corruption at startup. · 277b270f

Tim Wickberg authored Feb 11, 2016

Compare the saved clustername to slurmctld's, if there's a mismatch prevent
slurmctld from starting to avoid corruption from separate clusters attempting
to share a single state directory.

Bug 2433.

277b270f

10 Feb, 2016 3 commits

Fix slurmdbd segfault when listing users with blank user condition. · aecd9432
Brian Christiansen authored Feb 10, 2016
```
Bug 2427

Reproduced with:
sacctmgr show user withassoc user= user=<username>
```
aecd9432

power/cray enhancements · 2cbc16af

Morris Jette authored Feb 10, 2016

Add new PowerParameters options of get_timeout and set_timeout. The default
    set_timeout was increased from 5 seconds to 30 seconds. Also re-read current
    power caps periodically or after any failed "set" operation.
bug 2332

2cbc16af

Forgot to update NEWS. · 3bf84c90

Tim Wickberg authored Feb 10, 2016

Build should work on non-glibc distributions by better POSIX conformance.

3bf84c90

09 Feb, 2016 2 commits
- Add function to remove assoc_mgr_info_request_t members without freeing · 166da802
  Danny Auble authored Feb 08, 2016
```
structure.
```
  166da802
- Flesh out filters for scontrol show assoc_mgr · ee37357f
  Alejandro Sanchez authored Feb 08, 2016
  
  ee37357f
08 Feb, 2016 1 commit
- Remove unneeded perl files from the .spec file. · 656b3501
  Danny Auble authored Feb 08, 2016
  
  656b3501
04 Feb, 2016 1 commit
- Fix for jobs requesting cpus-per-task (eg. -c3) that exceed the number of cpus on a core. · 51981a1f
  Brian Christiansen authored Feb 04, 2016
```
Bug 2406

Continuations of a0bb9c1e and 4b0fc9e8
```
  51981a1f
03 Feb, 2016 2 commits
- Fix perl api for newer perl versions. · 0b66cc22
  Yu Watanabe authored Feb 02, 2016
  
  0b66cc22
- Fix cosmetic printing of NO_VALs in scontrol show assoc_mgr. · 20b2406f
  Danny Auble authored Feb 02, 2016
  
  20b2406f
02 Feb, 2016 3 commits

Fix build for sh5util on ppc64 by replacing printf formatters · b717bf5c
Tim Wickberg authored Feb 02, 2016
```
Use PRIu64 instead of %ld for uint64_t types (libsh5util_old),
%zu instead of PRIu64 for size_t.
```
b717bf5c
update NEWS to mention FreeBSD fixes · e9982fa4
Tim Wickberg authored Feb 02, 2016

e9982fa4

Fix support for AuthInfo in slurmdbd.conf · fa4222ec

Didier GAZEN authored Feb 02, 2016

Support AuthInfo in slurmdbd.conf that is different from the value in
    slurm.conf.
There is a possible bug in the slurm_get_auth_info function (src/common/slurm_protocol_api.c) that can cause the slurmdbd daemon to look for the AuthInfo parameter in slurm.conf instead of slurmdbd.conf when the auth/munge authentication method is used (AuthType=auth/munge).

Here is the slurmdbd log revealing the problem (debug5() printing were added in the sources) :

slurmdbd: slurmdbd version 15.08.7 started
slurmdbd: debug2: running rollup at Tue Feb 02 14:20:14 2016
slurmdbd: debug5: in ../../../src/slurmdbd/slurmdbd.c, _send_slurmctld_register_req (line 690)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_send_node_msg (line 3601)
slurmdbd: debug5: in ../../../../../src/plugins/auth/munge/auth_munge.c, slurm_auth_create (line 217)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_get_auth_ttl (line 1732)
slurmdbd: debug5: Entering ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info
slurmdbd: debug:  Reading slurm.conf file: /usr/local/slurm-15-08-7-1/etc/slurm.conf
slurmdbd: error: s_p_parse_file: unable to status file /usr/local/slurm-15-08-7-1/etc/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
...

Then 60 seconds later, the auth_info value returned by slurm_get_auth_info is NULL:

slurmdbd: debug5: Leaving ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info, auth_info=(null)

and slurmdbd continues without crashing, but I am not sure it is in a safe state.

When applying this patch :

diff --git a/src/common/slurm_protocol_api.c b/src/common/slurm_protocol_api.c
index c5db879..be1dab6 100644
--- a/src/common/slurm_protocol_api.c
+++ b/src/common/slurm_protocol_api.c
@@ -1703,9 +1703,13 @@ extern char *slurm_get_auth_info(void)
        char *auth_info;
        slurm_ctl_conf_t *conf;

-       conf = slurm_conf_lock();
-       auth_info = xstrdup(conf->authinfo);
-       slurm_conf_unlock();
+       if (slurmdbd_conf) {
+                auth_info = xstrdup(slurmdbd_conf->auth_info);
+        } else {
+               conf = slurm_conf_lock();
+               auth_info = xstrdup(conf->authinfo);
+               slurm_conf_unlock();
+       }

        return auth_info;
 }

the auth_info value is now valid and consistent with the slurmdbd.conf setting:

slurmdbd: slurmdbd version 15.08.7 started
slurmdbd: debug2: running rollup at Tue Feb 02 14:47:37 2016
slurmdbd: debug5: in ../../../src/slurmdbd/slurmdbd.c, _send_slurmctld_register_req (line 690)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_send_node_msg (line 3600)
slurmdbd: debug5: in ../../../../../src/plugins/auth/munge/auth_munge.c, slurm_auth_create (line 217)
slurmdbd: debug5: in ../../../src/common/slurm_protocol_api.c, slurm_get_auth_ttl (line 1731)
slurmdbd: debug5: Entering ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info
slurmdbd: debug5: Leaving ../../../src/common/slurm_protocol_api.c, slurm_get_auth_info, auth_info=socket=/var/run/munge/munge_dbd.socket.2

fa4222ec

01 Feb, 2016 1 commit

update contributed cray config generator · 9f45e264

David Gloe authored Feb 01, 2016

contribs/cray/csm/slurmconfgen_smw.py - avoid erroneously including
repurposed compute nodes in the list of nodes to start slurmd.

9f45e264

29 Jan, 2016 1 commit

Fix double free of memory in slurmctld background mode · b1d88c7b

Morris Jette authored Jan 29, 2016

When the slurmctld is in background mode, it will issue double
  free calls on the incomming message buffers, likely leading
  an abort.

b1d88c7b

28 Jan, 2016 1 commit

Don't relocated multi-node core reservations · a801d264

Morris Jette authored Jan 28, 2016

Do not automatically relocate an advanced reservation for individual cores
that spans multiple nodes when nodes in that reservation go down (e.g.
a 1 core reservation on node "tux1" will be moved if node "tux1" goes
down, but a reservation containing 2 cores on node "tux1" and 3 cores on
"tux2" will not be moved node "tux1" goes down). Advanced reservations for
whole nodes will be moved by default for down nodes.
bug 2326

a801d264