Commits · 87d9370fcbdb87f460fa5191d690eb0a834fb599 · Manuel G. Marciani / ces_slurm_simulator

27 Jan, 2016 1 commit
- Increase debug level to debug3 as function just iterates over associations · 87d9370f
  Alejandro Sanchez authored Jan 27, 2016
  
  87d9370f
26 Jan, 2016 1 commit

Add lock for newly changed boot logic · 4d897176

Morris Jette authored Jan 25, 2016

Both the original logic and modified logic failed to lock the job
data structure prior to decrementing "prolog_running" counter.

4d897176

25 Jan, 2016 4 commits
- Comment code to note why certain RPCs are handled differently than others · baded5e8
  Danny Auble authored Jan 25, 2016
```
in the forward message logic.
```
  baded5e8
- Launch job requesting reboot after the boot completes · 608421da
  Morris Jette authored Jan 25, 2016
```
Previously under some conditions that boot completion was ignored
and the job kept pending.
```
  608421da
- whitespace fix - tabs not spaces · 1743225c
  Tim Wickberg authored Jan 25, 2016
  
  1743225c
- Fix use of uninitialized variable in Perl API's slurm_job_step_get_pids · 03b254ef
  Sergey Meirovich authored Jan 25, 2016
  
  03b254ef
22 Jan, 2016 1 commit
- Remove unneeded line from NEWS · 2b2fd513
  Danny Auble authored Jan 22, 2016
  
  2b2fd513
21 Jan, 2016 11 commits
- MySQL - Fix querying jobs with reservations when the id's have rolled. · 201d691a
  Danny Auble authored Jan 21, 2016
```
Bug 2364
```
  201d691a
- Remove redundant logic when updating a job's task count. · ce3d6c8d
  Danny Auble authored Jan 21, 2016
```
Commit fa331e30 fixes this.  The logic was bad to begin with...

uint32_t new_cpus = detail_ptr->num_tasks
	/ detail_ptr->cpus_per_task;

The / should had been * this whole time.  This was the reason we found
this in the first place.
```
  ce3d6c8d
- Fix typo in slurm.conf man page · b23f2314
  Morris Jette authored Jan 21, 2016
```
bug 2369
```
  b23f2314
- Spelling fixes · 356a25c4
  Gennaro Oliva authored Jan 21, 2016
  
  356a25c4
- Add scancel delay if slow responses · d8b12b01
  Morris Jette authored Jan 21, 2016
```
If scancel is operating on large number of jobs and RPC responses from
    slurmctld daemon are slow then introduce a delay in sending the cancel job
    requests from scancel in order to reduce load on slurmctld.
bug 2256
```
  d8b12b01
- Fix test for delayed job launch · 456d498d
  Morris Jette authored Jan 21, 2016
```
If a job launch is delayed, the test was failing due to bad parsing.
These lines were being interpretted as a counter folloed by node
names of "queued" and "has":
  srun: job 1332712 queued and waiting for resources
  srun: job 1332712 has been allocated resources
```
  456d498d
- Fix typo of "childern" · 9cc9985e
  Morris Jette authored Jan 21, 2016
  
  9cc9985e
- Modify error message when MaxJobCount reached · f11ec171
  Morris Jette authored Jan 21, 2016
```
bug 2366
```
  f11ec171
- Make it so daemons also support TopologyParam=NoInAddrAny · 4b9cf731
  Danny Auble authored Jan 20, 2016
  
  4b9cf731
- Backfill sync with Cray NHC · 79a21bd6
  Morris Jette authored Jan 20, 2016
```
Backfill scheduling properly synchronized with Cray Node Health Check.
    Prior logic could result in highest priority job getting improperly
    postponed.
bug 2350
```
  79a21bd6
- add lines in news for next tag 15.08.8 · d461b724
  Danny Auble authored Jan 20, 2016
  
  d461b724
20 Jan, 2016 8 commits

New tag for 15.08.7 · be18995e
Danny Auble authored Jan 20, 2016

be18995e
Add missing job states from the qstat wrapper. · 72355b73
Danny Auble authored Jan 20, 2016

72355b73
Strip flags from a job state in qstat wrapper before evaluating. · 24732902
Danny Auble authored Jan 20, 2016

24732902
Expand memory leak test description · fa63067f
Morris Jette authored Jan 20, 2016

fa63067f

Correct job_cnt_run NULL pointer · 207adf8e

Morris Jette authored Jan 20, 2016

This corrects logic from commit e5a61746
that could result in use of NULL pointer

207adf8e

Prevent job_cnt_run · f76586bf

Morris Jette authored Jan 19, 2016

It was previously triggered by executing "scontrol reconfig" on a
  front-end system while there was a job in completing state.

f76586bf

Properly track resources for suspended jobs on reconfig · 21c52d2f

Morris Jette authored Jan 19, 2016

Properly account for memory, CPUs and GRES when slurmctld is reconfigured
    while there is a suspended job. Previous logic would add the CPUs, but not
    memory or GPUs. This would result in underflow/overflow errors in select
    cons_res plugin.
bug 2353

21c52d2f

Correct handling of front-end running job count · e5a61746

Morris Jette authored Jan 19, 2016

The counter is really intended to reflect the count of running or
  suspended jobs rather than running jobs alone. Previous logic
  would report an underflow for the "job_cnt_run" variable if
  1. job submitted
  2. job suspended
  3. scontrol reconfig
  4. job cancelled

e5a61746

19 Jan, 2016 3 commits

Improve select/cons_res logging · 82f61b0d

Morris Jette authored Jan 19, 2016

Log the length of bitmaps in addition to the bits set.
Also increase the string length used for logging.

82f61b0d

Fix for socket allocations and specialized cores · a260397a

Morris Jette authored Jan 19, 2016

Previous logic would prevent allocation of sockets to a job unless the
entire socket was available. If there were any specialized cores, the
socket was treated as being not available and unusable. For example,
if a node had 2 sockets, then a job requesting 2 specialized cores
would reserve one core on each of the two sockets and render the job
not runnable.

a260397a

Remove redundant sinfo logic · 5e08b4d1

Morris Jette authored Jan 19, 2016

There was logic in sinfo's print state function that determined
if the state was MIXED. This logic was duplicated logic from the
_query_server() function in sinfo.c and has been removed. Also
note the logic was already gone from the "short state" print
function (I noticed the discrepeancy in the print functions,
but discovered they both printed the correct state information).

5e08b4d1

17 Jan, 2016 1 commit

jette authored Jan 16, 2016

Fix backfill scheduling bug which could postpone the scheduling of jobs due
    to avoidance of nodes in COMPLETING state.
bug 2350

1a4b5983

16 Jan, 2016 2 commits
- Add job partition to allocation log message · a49ff936
  Morris Jette authored Jan 15, 2016
  
  a49ff936
- Minor streamlining of logging · 803eebd1
  Morris Jette authored Jan 15, 2016
```
No need to look up the Reason string for a job, we just set the value.
```
  803eebd1
15 Jan, 2016 4 commits
- slurmdbd.conf spelling fixes. · 192cf0fc
  Brian Christiansen authored Jan 15, 2016
  
  192cf0fc
- Prevent decrementing of TRESRunMins when AccountingStorageEnforce=limits is not set. · 47c22b2e
  Brian Christiansen authored Jan 15, 2016
```
Bug 2255
```
  47c22b2e
- Fix memory leak in slurmctld job array logic · 38ca2e67
  Morris Jette authored Jan 15, 2016
  
  38ca2e67
- Fix nodes from being overallocated when allocation straddles multiple nodes. · fb604c7e
  Brian Christiansen authored Jan 14, 2016
```
Bug 2343
```
  fb604c7e
14 Jan, 2016 3 commits
- Merge branch 'slurm-14.11' into slurm-15.08 · c17396d7
  Morris Jette authored Jan 14, 2016
  
  c17396d7
- fix AuthInfo with alternate munge socket location · f3d54f99
  Morris Jette authored Jan 14, 2016
```
Fix for configuration of "AuthType=munge" and "AuthInfo=socket=..." with
    alternate munge socket path.
bug 2348
```
  f3d54f99
- Avoid slurmstepd abort if malloc fails for accounting · d5400aa5
  Morris Jette authored Jan 13, 2016
```
If a node is out of memory, then the malloc performed by slurmstepd
  periodically may fail, killing the slurmstepd and orphaning it's
  processes.
bug 2341
```
  d5400aa5
13 Jan, 2016 1 commit

backfill scheduling with group limits fix · 3ee1632f

Morris Jette authored Jan 13, 2016

Backfill scheduling fix: If a job can't be started due to a "group" resource
limit, rather than reserve resources for it when the next job ends, don't
reserve any resources for it. The problem with the original logic is that
if a lot of resources are reserved for such pending jobs, then jobs futher
down the queue may defered when they really can and should be started. An
ideal solution would track all of the TRES resources through time as jobs
start and end, but we don't have that logic in the backfill scheduler and
don't want that extra overhead in the backfill scheduler.
bugs 2326 and 2282

3ee1632f