Commits · 43d0ad6f76f046da06f00e77fce6c711c6d1b8bd · Manuel G. Marciani / ces_slurm_simulator

14 Jul, 2015 2 commits
- CRAY - Fix seg fault if a blade is replaced and slurmctld is restarted. · 43d0ad6f
  Danny Auble authored Jul 14, 2015
  
  43d0ad6f
- Job array update fix · e2987cf8
  Morris Jette authored Jul 14, 2015
```
Previous logic could fail to update some tasks of a job array for
  some fields.
bug 1777
```
  e2987cf8
13 Jul, 2015 2 commits

job array update results in bad task ID · 29a52f60

Morris Jette authored Jul 13, 2015

Fix to job array update logic that can result in a task ID of 4294967294.
To reproduce:
$ sbatch --exclusive -a 1,3,5 tmp
Submitted batch job 11825
$ scontrol update jobid=11825_[3,4,5] timelimit=3
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           11825_3     debug      tmp    jette PD       0:00      1 (None)
           11825_4     debug      tmp    jette PD       0:00      1 (None)
           11825_5     debug      tmp    jette PD       0:00      1 (None)
             11825     debug      tmp    jette PD       0:00      1 (Resources)
A new job array entry was created for task ID 4 and the "master" job
array record now has a task ID of 4294967294.
The logic with the bug was using the wrong variable in a test.
bug 1790

29a52f60

Fix segfault when updating timelimit on jobarray task. · 0560d8b2
Gene Soudlenkov authored Jul 13, 2015
```
Bug 1799
```
0560d8b2

10 Jul, 2015 2 commits

Remove additiional slurmctld parallelism · e0e15234

Morris Jette authored Jul 10, 2015

remove new capabilities added in comit ad9c2413
Leave the new logic only in version 15.08, which has related
performance improvements in the slurmctld agent code, see commit
53534f49

e0e15234

Correct sdiag backfill cycle time · bee8cd21

Morris Jette authored Jul 10, 2015

Correct "sdiag" backfill cycle time calculation if it yields locks. A
    microsecond value was being treated as a second value resulting in an
    overflow in the calcuation.
bug 1788

bee8cd21

09 Jul, 2015 1 commit

Change slurmctld threads count against limit · ad9c2413

Morris Jette authored Jul 09, 2015

The slurmctld logic throttles some RPCs so that only one of them
can execute at a time in order to reduce contention for the job,
partition and node locks (only one of the effected RPCs can execute
at any time anyway and this lets other RPC types run). While an
RPC is stuck in the throttle function, do not count that thread
against the slurmctld thread limit.
but 1794

ad9c2413

08 Jul, 2015 1 commit
- Start NEWS for v14.11.9 · 527b61ec
  Morris Jette authored Jul 07, 2015
  
  527b61ec
07 Jul, 2015 3 commits

Update job's QOS before partition · f2faa213

Trey Dockendorf authored Jul 07, 2015

This patch moves the QOS update of an existing job to be before the
partition update. This ensures a new QOS value is the value used when
doing validations against things like a partition's AllowQOS and DenyQOS.

Currently if a two partitions have AllowQOS that do not share any QOS,
the order of updates prevents a job from being moved from one partition
to another using something like the following:

scontrol update job=<jobID> partition=<new part> qos=<new qos>

f2faa213

Fix the scontrol man page describing the release argument · 2053cbb3
David Bigagli authored Jul 07, 2015

2053cbb3

Correct pack node logic · 0e0c64de

Morris Jette authored Jul 06, 2015

Correct task layout with CR_Pack_Node option and more than 1 CPU per task.
Previous logic would place one task per CPU launch too few tasks.
bug 1781

0e0c64de

06 Jul, 2015 2 commits

scheduler/backfill enhancements · edfbabe6

Morris Jette authored Jul 06, 2015

Backfill scheduler now considers OverTimeLimit and KillWait configuration
parameters to estimate when running jobs will exit. Initially the job's
end time is estimated based upon it's time limit. After the time limit
is reached, the end time estimate is based upon the OverTimeLimit and
KillWait configuration parameters.
bug 1774

edfbabe6

Add backfill scheduler timeout · 7e944220

Morris Jette authored Jul 06, 2015

Backfill scheduler: The configured backfill_interval value (default 30
    seconds) is now interpretted as a maximum run time for the backfill
    scheduler. Once reached, the scheduler will build a new job queue and
    start over, even if not all jobs have been tested.
bub 1774

7e944220

30 Jun, 2015 2 commits
- Display error message when attempting to modify priority of a held job. · d798caa9
  Thomas Cadeau authored Jun 30, 2015
```
Bug 1745
```
  d798caa9
- Revert "Display error message when attempting to modify priority of a held job." · 36cb918c
  Brian Christiansen authored Jun 30, 2015
```
This reverts commit 3f91f4b2.
```
  36cb918c
29 Jun, 2015 1 commit
- Display error message when attempting to modify priority of a held job. · 3f91f4b2
  Nathan Yee authored Jun 29, 2015
```
Bug 1745
```
  3f91f4b2
25 Jun, 2015 1 commit
- Clarify a NEWS item · 83ed9780
  Morris Jette authored Jun 24, 2015
  
  83ed9780
24 Jun, 2015 1 commit
- Fix core dump. · 7b99dcd0
  David Bigagli authored Jun 24, 2015
  
  7b99dcd0
23 Jun, 2015 1 commit
- Set the totalview_stepid to the value of the job step instead of NO_VAL. · 5456f107
  David Bigagli authored Jun 23, 2015
  
  5456f107
22 Jun, 2015 3 commits

Advanced reservation fixes · a6454176

Morris Jette authored Jun 22, 2015

Updates of existing bluegene advanced reservations did not work at all.
Some multi-core configurations resulting in an abort due to creating
  core_bitmaps for the reservation that only had one bit per node rather
  than one bit per core.
These bugs were introduced in commit 5f258072

a6454176

Update NEWS · c8545598
David Bigagli authored Jun 22, 2015

c8545598
Update NEWS · 38007f9b
David Bigagli authored Jun 22, 2015

38007f9b

19 Jun, 2015 1 commit
- Fix squeue to print according to the man page. · 2973524c
  David Bigagli authored Jun 19, 2015
  
  2973524c
15 Jun, 2015 1 commit

Prevent abort on update of license-only reservation · 50deadb4

Morris Jette authored Jun 15, 2015

Logic was assuming the reservation had a node bitmap which was
being used to check for overlapping jobs. If there is no node
bitmap (e.g. a licenses only reservation), an abort would result.

50deadb4

12 Jun, 2015 2 commits
- Set job's reason to BadConstaints when job can't run on any node. · 475988f5
  Brian Christiansen authored Jun 12, 2015
```
Bug 1739
```
  475988f5
- Deprecated TICKET_BASED fairshare. · c3a30337
  Brian Christiansen authored Jun 12, 2015
```
Bug 1743
```
  c3a30337
11 Jun, 2015 1 commit
- Use correct slurmd spooldir when creating cpu-frequency locks. · 9d20cf02
  Brian Christiansen authored Jun 10, 2015
```
Bug 1733
```
  9d20cf02
10 Jun, 2015 1 commit
- Add NEWS for last commit · 30e50e6c
  Morris Jette authored Jun 10, 2015
  
  30e50e6c
09 Jun, 2015 2 commits

Search for user in all groups · 93ead71a
David Bigagli authored Jun 09, 2015

93ead71a

Fix scheduling inconsistency with GRES · e1a00772

Morris Jette authored Jun 09, 2015

1. I submit a first job that uses 1 GPU:
$ srun --gres gpu:1 --pty bash
$ echo $CUDA_VISIBLE_DEVICES
0

2. while the first one is still running, a 2-GPU job asking for 1 task per node
waits (and I don't really understand why):
$ srun --ntasks-per-node=1 --gres=gpu:2 --pty bash
srun: job 2390816 queued and waiting for resources

3. whereas a 2-GPU job requesting 1 core per socket (so just 1 socket) actually
gets GPUs allocated from two different sockets!
$ srun -n 1  --cores-per-socket=1 --gres=gpu:2 -p testk --pty bash
$ echo $CUDA_VISIBLE_DEVICES
1,2

With this change #2 works the same way as #3.
bug 1725

e1a00772

05 Jun, 2015 1 commit
- Revert "Fix issue where command line options were parsed twice in sbatch." · b37004e2
  Danny Auble authored Jun 02, 2015
```
Only going to do this in the master as it may affect scripts.

This reverts commit 454f78e6.

Conflicts:
	NEWS
```
  b37004e2
04 Jun, 2015 2 commits
- Partially modify the commit 971d0021 . · 707268a5
  David Bigagli authored Jun 04, 2015
  
  707268a5
- Fix sacctmgr archive loading of older versions. · bf07cfcc
  David Bigagli authored Jun 03, 2015
  
  bf07cfcc
03 Jun, 2015 1 commit

switch/cray: Refine PMI_CRAY_NO_SMP_ENV set · ef66b2eb

Morris Jette authored Jun 03, 2015

switch/cray: Refine logic to set PMI_CRAY_NO_SMP_ENV environment variable.
Rather than testing for the task distribution option, test the actual
task IDs to see fi they are monotonically increasing across all nodes.
Based upon idea from Brian Gilmer (Cray).

ef66b2eb

02 Jun, 2015 3 commits
- Fix issue where command line options were parsed twice in sbatch. · 454f78e6
  Danny Auble authored Jun 02, 2015
  
  454f78e6
- Fix issue where sbatch would set ntasks-per-node to 0 making any srun · 9f67ad99
  Danny Auble authored Jun 02, 2015
```
afterward cause a divide by zero error.
```
  9f67ad99
- When deleting a job from the system set the job_id to 0 to avoid memory · 0b007678
  Danny Auble authored Jun 01, 2015
```
corruption if thread uses the pointer basing validity off the id.

Bug 1710
```
  0b007678
01 Jun, 2015 1 commit
- Update NEWS. · c3383298
  David Bigagli authored Jun 01, 2015
  
  c3383298
30 May, 2015 1 commit
- CRAY - Remove libpmi from rpm install · 374f2db9
  Danny Auble authored May 29, 2015
  
  374f2db9
29 May, 2015 1 commit
- Fix race condition where last array task might not get updated in the db. · d95f1ed6
  Brian Christiansen authored May 29, 2015
```
Bug 1495
```
  d95f1ed6