Commits · add3d8cd255f704175daf6ef7075e8662f916c8d · Manuel G. Marciani / ces_slurm_simulator

22 Sep, 2015 1 commit

Correct job count limit logic for job arrays · add3d8cd

Danny Auble authored Sep 21, 2015

Correct counting for job array limits, job count limit underflow possible
    when master cancellation of master job record.
bug 1952

add3d8cd

21 Sep, 2015 2 commits
- Addition to last commit b404c3af to do the same thing for removing. · 16e9399d
  Danny Auble authored Sep 21, 2015
```
Also a very minor sanity check in job_mgr.c to make sure we at least have
a task count.  This shouldn't matter, but just to be as robust as possible.
```
  16e9399d
- Fix to handle arrays with respect to number of jobs submitted. Previously · b404c3af
  Nathan Yee authored Sep 21, 2015
```
only 1 job was accounted (against MaxSubmitJob) for when an array was
submitted.
```
  b404c3af
17 Sep, 2015 2 commits
- Fix typo. · 48b7c2f0
  David Bigagli authored Sep 17, 2015
  
  48b7c2f0
- Fix typo in MPI docuentation. · 0ba1f59a
  Tommi Tervo authored Sep 17, 2015
  
  0ba1f59a
11 Sep, 2015 3 commits
- Merge branch 'slurm-14.03' into slurm-14.11 · 3ed8df56
  Morris Jette authored Sep 11, 2015
  
  3ed8df56
- handle job kill while step prolog running · bda0a436
  Morris Jette authored Sep 11, 2015
```
This prevents a step from being launched if the job is killed
while the prolog is running. Reproducing the original failure
requires use of srun to trigger the prolog and using scancel
while that prolog is running.
bug 1755
```
  bda0a436
- Fix slurmdbd backup to use DbdAddr when contacting the primary. · 5beb84db
  Brian Christiansen authored Sep 11, 2015
```
And add missing documenation.
Bug 1921
```
  5beb84db
10 Sep, 2015 5 commits

Fix gres tracking for multiple steps · af1163a2

Morris Jette authored Sep 10, 2015

GRES were not being properly tracks for multiple simultaneous steps.
A step which could have run later could be rejected as never being
able to run.
Replacement for commit dd842d79, which was reverted in commit 6f73812875c
bug 1925

af1163a2

Revert commit · 2463200a

Morris Jette authored Sep 10, 2015

That commit would address a limited subset of problems and introduce
other bugs rather than fixing the root of the problem.

2463200a

Fix unit conversion bug. · fa90d2c7
David Bigagli authored Sep 10, 2015

fa90d2c7
Fix scontrol core dump. · 4e3ff395
David Bigagli authored Sep 10, 2015

4e3ff395

Fix issue with GRES in steps so that if you have multiple exclusive steps · dd842d79

Danny Auble authored Sep 09, 2015

and you use all the GRES up instead of reporting the configuration isn't
available you hold the requesting step until the GRES is available.

dd842d79

09 Sep, 2015 2 commits
- Remove trailing space, no change in logic · 94113363
  Morris Jette authored Sep 08, 2015
  
  94113363
- don't truncate sview/squeue task ID info · 66f2bbc6
  Morris Jette authored Sep 08, 2015
```
Don't trucate task ID information in "squeue --array/-r" output.
Task ID info in sview also expanded to 64 characters (from ~16 chars).
```
  66f2bbc6
08 Sep, 2015 7 commits
- Modify test for slight change in error message from slurm · 43f5aef3
  Morris Jette authored Sep 08, 2015
  
  43f5aef3
- Preserve job reason from start of scheduling cycle · 74184e4d
  Morris Jette authored Sep 08, 2015
```
At the start of a scheduling cycle, the job's "reason" field can
be cleared. If the scheduler fails to reach that job and set its
value to a new reason, the original reason was lost and the state
reports would report NoReason. This change saves the last reason
for a job being in a pending state and reports that value to the
user until we have a new valid reason for it still being in a
PENDING state.
bug 1919
```
  74184e4d
- Cosmetic change, long line · 98a2f37a
  Morris Jette authored Sep 08, 2015
  
  98a2f37a
- Improve slurmstepd logging · e6812a71
  Morris Jette authored Sep 08, 2015
  
  e6812a71
- backport of commit 2ae66435 · a9dc0097
  Morris Jette authored Sep 08, 2015
```
bug 1920
```
  a9dc0097
- Update the slurm.conf man page documenting better nohold_on_prolog_fail · 4be84f67
  David Bigagli authored Sep 08, 2015
  
  4be84f67
- Add Copyright and correct the code format. · 9b693642
  David Bigagli authored Sep 08, 2015
  
  9b693642
02 Sep, 2015 4 commits

Test changes for NetBSD · abb265d4

Nicolas Joly authored Sep 02, 2015

Do not use full path for true utility in testsuite. Its location
differ across systems (/bin/true on Linux and /usr/bin/true on BSD).

abb265d4

Add the uid t the log message when job is released. · 243b3dba
David Bigagli authored Sep 02, 2015

243b3dba

Leave down/drain node unavailable when powered down · 4c3491a4

Morris Jette authored Sep 01, 2015

Previous logic would set the avail_node_bitmap when
a node was powered down, even if the initial state was
DOWN or DRAINED. This made the node available for allocation
to a job, which we don't want until the DOWN or DRAIN
state is cleared.
bug 1893

4c3491a4

Revert power_save patches · df5c3a1f
Morris Jette authored Sep 01, 2015
```
This reverts commits
7660da9e
5c386455 and
f6c5302b
```
df5c3a1f

01 Sep, 2015 6 commits
- Fix truncation of job reason in squeue. · 0b9a5d6a
  Brian Christiansen authored Sep 01, 2015
```
Bug 1741
```
  0b9a5d6a
- Move note to correct version. · 7660da9e
  David Bigagli authored Sep 01, 2015
  
  7660da9e
- Put in note for next potential tag · 98e9196f
  Danny Auble authored Sep 01, 2015
  
  98e9196f
- Modify f6c5302b to handle drained nodes as well · 5c386455
  David Bigagli authored Sep 01, 2015
  
  5c386455
- If a node is down don't set it in power suspend mode. · f6c5302b
  David Bigagli authored Sep 01, 2015
  
  f6c5302b
- new tag · 6e41ca65
  Danny Auble authored Aug 31, 2015
  
  6e41ca65
28 Aug, 2015 4 commits

Requeue job if possible when slurmstepd aborts · d8e6f55d

Morris Jette authored Aug 28, 2015

This problem is reproducible by launching a job then killing the
  slurmstepd process. Under those conditions, requeue the job if
  possible (i.e. batch job with requeue option/configuration).
  This patch also improves the slurmctld logging when this happens.
bug 1889

d8e6f55d

Fix man page. · 849ab34d
David Bigagli authored Aug 28, 2015

849ab34d
Add more debug messages. · 00075062
David Bigagli authored Aug 28, 2015

00075062

Correction to CoreCnt in reservation · d846dd4c

Morris Jette authored Aug 27, 2015

This is a change in logic from commit 00099596
The original commit corrected the logic from CPU to core count, but
used the cores_per_socket count rather than computing the total
core count (cores_per_socket * sockets).
bug 1830

d846dd4c

27 Aug, 2015 3 commits

Correct RebootProgram usage · 82068b6b

Morris Jette authored Aug 27, 2015

Correct RebootProgram logic when executed outside of a maintenance
  reservation. Previous logic would mark the node up upon response
  to the reboot RPC (from slurmctld to slurmc) and when the node
  actually rebooted, flag that as an unexpected reboot. This new
  logic checks the node's up time to not mark the compute node as
  being usable until the reboot actually takes place.
but 1866

82068b6b

Add computed node emulation logic · 6be817d7

Morris Jette authored Aug 27, 2015

For testing purposes, add a slurmd flag so that it appears the
compute node has rebooted whenever slurmd restarts.

6be817d7

Fix some potential deadlock issues when state files don't exist in the · f6bc60cc
Danny Auble authored Aug 26, 2015
```
association manager.
```
f6bc60cc

26 Aug, 2015 1 commit

Prevent wrong job array task ID from being shown · 1d2545ca

Morris Jette authored Aug 26, 2015

Prevent job array task ID from being reported as NO_VAL if last task in the
array gets requeued. The problem is that when that task starts, the task
bitmap entry for it stays set, but the task counter gets decremented.
If that job then gets requeued, under some conditions a failure to schedule
it results in the array_task_id in the job record getting set to NO_VAL.
Then when building the job info to report for squeue/scontrol, the string
showing the pending task ID's is not rebuilt due to that counter being
zero. All indications are that the job runs fine, only the information
reported to squeue/scontrol is wrong.
bug 1790

1d2545ca