Commits · dcadbb7c76ecca167ab3a1068d2351ed6450c9dd · Manuel G. Marciani / ces_slurm_simulator

25 Nov, 2015 1 commit
- Update NEWS · cd630a25
  David Bigagli authored Nov 25, 2015
  
  cd630a25
16 Nov, 2015 1 commit
- Log the request to terminate a job at info level if DebugFlags includes · 8dcbe1bb
  David Bigagli authored Nov 16, 2015
```
the Steps keyword.
```
  8dcbe1bb
13 Nov, 2015 1 commit
- Update NEWS · 00615dbf
  David Bigagli authored Nov 13, 2015
  
  00615dbf
04 Nov, 2015 1 commit
- Fix systemd's slurmd service from killing slurmstepds on shutdown. · 508f866e
  Brian Christiansen authored Nov 04, 2015
```
Bug 2095
```
  508f866e
22 Oct, 2015 1 commit
- Update NEWS for start of v14.11.11 work · a5f7dce8
  Morris Jette authored Oct 22, 2015
  
  a5f7dce8
19 Oct, 2015 1 commit
- Prevent slurmstepd from core dumping. · 52b7dd04
  David Bigagli authored Oct 19, 2015
  
  52b7dd04
07 Oct, 2015 2 commits
- Fix sacct -j, (nothing but a comma) to not return all jobs. · d5979ef6
  Danny Auble authored Oct 07, 2015
  
  d5979ef6
- Fix issue with sacct, printing 0_0 for array's that had finished in the · 75ea13a3
  Danny Auble authored Oct 06, 2015
```
database but the start record hadn't made it yet.
```
  75ea13a3
06 Oct, 2015 3 commits
- Permit job_submit plugin to set a job's priority · 3b5f13fa
  Thomas Cadeau authored Oct 06, 2015
```
bug 2011
```
  3b5f13fa
- Fix sacct to not return all jobs if the -j option is given with a trailing · 2646e761
  Danny Auble authored Oct 05, 2015
```
','.
```
  2646e761
- Propagate sbatch "--dist=plane=#" option to srun. · 6868906b
  Morris Jette authored Oct 05, 2015
```
bug 1999
```
  6868906b
03 Oct, 2015 1 commit

Don't requeue RPCs from slurmctld to DOWN nodes · f4ea9dec

Morris Jette authored Oct 02, 2015

Don't requeue RPC going out from slurmctld to DOWN nodes (can generate
    repeating communication errors).
bug 2002

f4ea9dec

02 Oct, 2015 1 commit

Don't mark powered down node as not responding · 8c03a8bc

Morris Jette authored Oct 01, 2015

This will only happen if a PING RPC for the node is already queued
  when the decision is made to power it down, then fails to get
  a response for the ping (since the node is already down).
bug 1995

8c03a8bc

30 Sep, 2015 2 commits

Reset job CPU count if CPUs/task ratio increased for mem limit · 836912bf

Morris Jette authored Sep 30, 2015

If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
then increase it's allocated CPU count in order to enforce CPU limits.
Previous logic would increase/set the cpus_per_task as needed if a
job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT
increase the min_cpus or max_cpus varilable. This resulted in allocating
the wrong CPU count.

836912bf

Don't start duplicate batch job · c1513956

Morris Jette authored Sep 29, 2015

Requeue/hold batch job launch request if job already running. This is
  possible if node went to DOWN state, but jobs remained active.
In addition, if a prolog/epilog failed DRAIN the node rather than
  setting it down, which could kill jobs that could continue to
  run.
bug 1985

c1513956

29 Sep, 2015 2 commits
- Fix srun -I<timeout> from flooding the controller with step create requests. · 1252d1a1
  Brian Christiansen authored Sep 29, 2015
```
Bug 1938
```
  1252d1a1
- Fix updating job in db after extending job's timelimit past partition's timelimit. · 7a0836fc
  Brian Christiansen authored Sep 29, 2015
```
Bug 1984
```
  7a0836fc
28 Sep, 2015 1 commit

Fix for node state when shrinking jobs · 6c9d4540

Morris Jette authored Sep 28, 2015

When nodes have been allocated to a job and then released by the
  job while resizing, this patch prevents the nodes from continuing
  to appear allocated and unavailable to other jobs. Requires
  exclusive node allocation to trigger. This prevents the previously
  reported failure, but a proper fix will be quite complex and
  delayed to the next major release of Slurm (v 16.05).
bug 1851

6c9d4540

23 Sep, 2015 1 commit
- For pending jobs have sacct print 0 for nnodes instead of the bogus 2. · 71287134
  Danny Auble authored Sep 23, 2015
```
The 2 came from the nodelist being "None assigned", which would be treated
as 2 hosts when sent into hostlist.
```
  71287134
22 Sep, 2015 1 commit

Correct job count limit logic for job arrays · add3d8cd

Danny Auble authored Sep 21, 2015

Correct counting for job array limits, job count limit underflow possible
    when master cancellation of master job record.
bug 1952

add3d8cd

21 Sep, 2015 1 commit
- Fix to handle arrays with respect to number of jobs submitted. Previously · b404c3af
  Nathan Yee authored Sep 21, 2015
```
only 1 job was accounted (against MaxSubmitJob) for when an array was
submitted.
```
  b404c3af
17 Sep, 2015 1 commit
- Fix typo. · 48b7c2f0
  David Bigagli authored Sep 17, 2015
  
  48b7c2f0
11 Sep, 2015 2 commits

handle job kill while step prolog running · bda0a436

Morris Jette authored Sep 11, 2015

This prevents a step from being launched if the job is killed
while the prolog is running. Reproducing the original failure
requires use of srun to trigger the prolog and using scancel
while that prolog is running.
bug 1755

bda0a436

Fix slurmdbd backup to use DbdAddr when contacting the primary. · 5beb84db
Brian Christiansen authored Sep 11, 2015
```
And add missing documenation.
Bug 1921
```
5beb84db

10 Sep, 2015 4 commits

Fix gres tracking for multiple steps · af1163a2

Morris Jette authored Sep 10, 2015

GRES were not being properly tracks for multiple simultaneous steps.
A step which could have run later could be rejected as never being
able to run.
Replacement for commit dd842d79, which was reverted in commit 6f73812875c
bug 1925

af1163a2

Fix unit conversion bug. · fa90d2c7
David Bigagli authored Sep 10, 2015

fa90d2c7
Fix scontrol core dump. · 4e3ff395
David Bigagli authored Sep 10, 2015

4e3ff395

Fix issue with GRES in steps so that if you have multiple exclusive steps · dd842d79

Danny Auble authored Sep 09, 2015

and you use all the GRES up instead of reporting the configuration isn't
available you hold the requesting step until the GRES is available.

dd842d79

09 Sep, 2015 1 commit

don't truncate sview/squeue task ID info · 66f2bbc6

Morris Jette authored Sep 08, 2015

Don't trucate task ID information in "squeue --array/-r" output.
Task ID info in sview also expanded to 64 characters (from ~16 chars).

66f2bbc6

08 Sep, 2015 1 commit
- Update the slurm.conf man page documenting better nohold_on_prolog_fail · 4be84f67
  David Bigagli authored Sep 08, 2015
  
  4be84f67
02 Sep, 2015 2 commits

Leave down/drain node unavailable when powered down · 4c3491a4

Morris Jette authored Sep 01, 2015

Previous logic would set the avail_node_bitmap when
a node was powered down, even if the initial state was
DOWN or DRAINED. This made the node available for allocation
to a job, which we don't want until the DOWN or DRAIN
state is cleared.
bug 1893

4c3491a4

Revert power_save patches · df5c3a1f
Morris Jette authored Sep 01, 2015
```
This reverts commits
7660da9e
5c386455 and
f6c5302b
```
df5c3a1f

01 Sep, 2015 4 commits
- Fix truncation of job reason in squeue. · 0b9a5d6a
  Brian Christiansen authored Sep 01, 2015
```
Bug 1741
```
  0b9a5d6a
- Move note to correct version. · 7660da9e
  David Bigagli authored Sep 01, 2015
  
  7660da9e
- Put in note for next potential tag · 98e9196f
  Danny Auble authored Sep 01, 2015
  
  98e9196f
- If a node is down don't set it in power suspend mode. · f6c5302b
  David Bigagli authored Sep 01, 2015
  
  f6c5302b
28 Aug, 2015 1 commit

Requeue job if possible when slurmstepd aborts · d8e6f55d

Morris Jette authored Aug 28, 2015

This problem is reproducible by launching a job then killing the
  slurmstepd process. Under those conditions, requeue the job if
  possible (i.e. batch job with requeue option/configuration).
  This patch also improves the slurmctld logging when this happens.
bug 1889

d8e6f55d

27 Aug, 2015 2 commits

Correct RebootProgram usage · 82068b6b

Morris Jette authored Aug 27, 2015

Correct RebootProgram logic when executed outside of a maintenance
  reservation. Previous logic would mark the node up upon response
  to the reboot RPC (from slurmctld to slurmc) and when the node
  actually rebooted, flag that as an unexpected reboot. This new
  logic checks the node's up time to not mark the compute node as
  being usable until the reboot actually takes place.
but 1866

82068b6b

Fix some potential deadlock issues when state files don't exist in the · f6bc60cc
Danny Auble authored Aug 26, 2015
```
association manager.
```
f6bc60cc

26 Aug, 2015 1 commit

Prevent wrong job array task ID from being shown · 1d2545ca

Morris Jette authored Aug 26, 2015

Prevent job array task ID from being reported as NO_VAL if last task in the
array gets requeued. The problem is that when that task starts, the task
bitmap entry for it stays set, but the task counter gets decremented.
If that job then gets requeued, under some conditions a failure to schedule
it results in the array_task_id in the job record getting set to NO_VAL.
Then when building the job info to report for squeue/scontrol, the string
showing the pending task ID's is not rebuilt due to that counter being
zero. All indications are that the job runs fine, only the information
reported to squeue/scontrol is wrong.
bug 1790

1d2545ca