Commits · 50deadb4465098f267e4065db2ffd86b90f9ba06 · Manuel G. Marciani / ces_slurm_simulator

15 Jun, 2015 1 commit

Prevent abort on update of license-only reservation · 50deadb4

Morris Jette authored Jun 15, 2015

Logic was assuming the reservation had a node bitmap which was
being used to check for overlapping jobs. If there is no node
bitmap (e.g. a licenses only reservation), an abort would result.

50deadb4

12 Jun, 2015 2 commits
- Set job's reason to BadConstaints when job can't run on any node. · 475988f5
  Brian Christiansen authored Jun 12, 2015
```
Bug 1739
```
  475988f5
- Deprecated TICKET_BASED fairshare. · c3a30337
  Brian Christiansen authored Jun 12, 2015
```
Bug 1743
```
  c3a30337
11 Jun, 2015 1 commit
- Use correct slurmd spooldir when creating cpu-frequency locks. · 9d20cf02
  Brian Christiansen authored Jun 10, 2015
```
Bug 1733
```
  9d20cf02
10 Jun, 2015 1 commit
- Add NEWS for last commit · 30e50e6c
  Morris Jette authored Jun 10, 2015
  
  30e50e6c
09 Jun, 2015 2 commits

Search for user in all groups · 93ead71a
David Bigagli authored Jun 09, 2015

93ead71a

Fix scheduling inconsistency with GRES · e1a00772

Morris Jette authored Jun 09, 2015

1. I submit a first job that uses 1 GPU:
$ srun --gres gpu:1 --pty bash
$ echo $CUDA_VISIBLE_DEVICES
0

2. while the first one is still running, a 2-GPU job asking for 1 task per node
waits (and I don't really understand why):
$ srun --ntasks-per-node=1 --gres=gpu:2 --pty bash
srun: job 2390816 queued and waiting for resources

3. whereas a 2-GPU job requesting 1 core per socket (so just 1 socket) actually
gets GPUs allocated from two different sockets!
$ srun -n 1  --cores-per-socket=1 --gres=gpu:2 -p testk --pty bash
$ echo $CUDA_VISIBLE_DEVICES
1,2

With this change #2 works the same way as #3.
bug 1725

e1a00772

05 Jun, 2015 1 commit
- Revert "Fix issue where command line options were parsed twice in sbatch." · b37004e2
  Danny Auble authored Jun 02, 2015
```
Only going to do this in the master as it may affect scripts.

This reverts commit 454f78e6.

Conflicts:
	NEWS
```
  b37004e2
04 Jun, 2015 2 commits
- Partially modify the commit 971d0021 . · 707268a5
  David Bigagli authored Jun 04, 2015
  
  707268a5
- Fix sacctmgr archive loading of older versions. · bf07cfcc
  David Bigagli authored Jun 03, 2015
  
  bf07cfcc
03 Jun, 2015 1 commit

switch/cray: Refine PMI_CRAY_NO_SMP_ENV set · ef66b2eb

Morris Jette authored Jun 03, 2015

switch/cray: Refine logic to set PMI_CRAY_NO_SMP_ENV environment variable.
Rather than testing for the task distribution option, test the actual
task IDs to see fi they are monotonically increasing across all nodes.
Based upon idea from Brian Gilmer (Cray).

ef66b2eb

02 Jun, 2015 3 commits
- Fix issue where command line options were parsed twice in sbatch. · 454f78e6
  Danny Auble authored Jun 02, 2015
  
  454f78e6
- Fix issue where sbatch would set ntasks-per-node to 0 making any srun · 9f67ad99
  Danny Auble authored Jun 02, 2015
```
afterward cause a divide by zero error.
```
  9f67ad99
- When deleting a job from the system set the job_id to 0 to avoid memory · 0b007678
  Danny Auble authored Jun 01, 2015
```
corruption if thread uses the pointer basing validity off the id.

Bug 1710
```
  0b007678
01 Jun, 2015 1 commit
- Update NEWS. · c3383298
  David Bigagli authored Jun 01, 2015
  
  c3383298
30 May, 2015 1 commit
- CRAY - Remove libpmi from rpm install · 374f2db9
  Danny Auble authored May 29, 2015
  
  374f2db9
29 May, 2015 5 commits
- Fix race condition where last array task might not get updated in the db. · d95f1ed6
  Brian Christiansen authored May 29, 2015
```
Bug 1495
```
  d95f1ed6
- select/linear: Correct CPU count · 58623ec7
  Morris Jette authored May 29, 2015
```
Correct count of CPUs allocated to job on system with hyperthreads.
The bug was introduced in commit a6d3074d
On a system with hyperthreads:
srun -n1 --ntasks-per-core=1 hostname
you would get:
slurmctld: error: job_update_cpu_cnt: cpu_cnt underflow on job_id 67072
```
  58623ec7
- preempt/job_prio plugin: Implement "warm-up" time · f5a8c6fb
  Morris Jette authored May 14, 2015
```
preempt/job_prio plugin: Implement the concept of Warm-up Time here. Use
the QoS GraceTime as the amount of time to wait before preempting.
Basically, skip preemption if your time is not up.
```
  f5a8c6fb
- preempt/job_prio plugin: fix for infinite loop · 5c302f8d
  Morris Jette authored May 14, 2015
  
  5c302f8d
- Fix issue with incorrect time calculation in the priority plugin when · cc49f09a
  Danny Auble authored May 29, 2015
```
a job runs past it's time limit.
```
  cc49f09a
28 May, 2015 1 commit
- Fix squeue -o %m and %d unit conversion to Megabytes. · c18bbab6
  Brian Christiansen authored May 28, 2015
```
Bug 1705
```
  c18bbab6
27 May, 2015 1 commit

Map job --mem-per-cpu=0 to --mem=0. · 33c77302

Morris Jette authored May 27, 2015

However, --mem=0 now reflects the appropriate amount of memory in the
system, --mem-per-cpu=0 hasn't changed.  This allows all the memory to
be allocated in a cgroup but is not "consumed" and is available for
other jobs running on the same host.
Eric Martin, Washington University School of Medicine

33c77302

26 May, 2015 1 commit

Correct job's "reason" with unavailable nodes · dd6d5ddc

Morris Jette authored May 26, 2015

Correct list of unavailable nodes reported in a job's "reason" field when
that job can not start.
bug 1614

dd6d5ddc

22 May, 2015 1 commit
- Eliminate need to set uid on job update calls · 9af8de14
  Morris Jette authored May 22, 2015
```
bug 1679
```
  9af8de14
21 May, 2015 1 commit
- make header for next tag · e824b341
  Danny Auble authored May 21, 2015
  
  e824b341
20 May, 2015 2 commits
- Prevent users from setting job's partition to an invalid partition. · 8b56b0f1
  Brian Christiansen authored May 20, 2015
```
Bug 1679
```
  8b56b0f1
- Fix formatting · 1bc0cc27
  Morris Jette authored May 20, 2015
  
  1bc0cc27
19 May, 2015 1 commit

switch/cray: fix PMI_CRAY_NO_SMP_ENV env var · 697d3d38

Morris Jette authored May 19, 2015

switch/cray: Revert logic added to 14.11.6 that set "PMI_CRAY_NO_SMP_ENV=1"
if CR_PACK_NODES is configured.
bug 1585

697d3d38

16 May, 2015 1 commit
- Make srun wait KillWait time when a task is cancelled · 142fe6f8
  David Bigagli authored May 15, 2015
  
  142fe6f8
15 May, 2015 2 commits
- preempt/job_prio plugin: Implement "warm-up" time · 7019cb6c
  Morris Jette authored May 14, 2015
```
preempt/job_prio plugin: Implement the concept of Warm-up Time here. Use
the QoS GraceTime as the amount of time to wait before preempting.
Basically, skip preemption if your time is not up.
```
  7019cb6c
- preempt/job_prio plugin: fix for infinite loop · 893d53bf
  Morris Jette authored May 14, 2015
  
  893d53bf
14 May, 2015 2 commits
- Update NEWS and Makefile.in · fb8c1928
  David Bigagli authored May 14, 2015
  
  fb8c1928
- Update BEWS. · c868e241
  David Bigagli authored May 14, 2015
  
  c868e241
13 May, 2015 3 commits
- Cray - Fix backup controller running native Slurm. · 5db73d97
  Brian Christiansen authored May 12, 2015
```
Bug 1627
```
  5db73d97
- Fix segfault when backup controller takes control for second time. · 54be6a4c
  Brian Christiansen authored May 12, 2015
  
  54be6a4c
- Fix small memory leak in backup controller. · c77aa354
  Brian Christiansen authored May 12, 2015
  
  c77aa354
12 May, 2015 1 commit
- Load libtinfo as needed with ncureses tools · 584f4d68
  Morris Jette authored May 11, 2015
  
  584f4d68
11 May, 2015 1 commit

Purge old step data on job requeue · beecc7b0

Morris Jette authored May 11, 2015

Make sure that old step data is purged when a job is requeued.
Without this logic, if a job terminates abnormally then old step
data may be left in slurmctld. If the job is then requeued and
started on a different node, referencing that old job step data
can result in abnormal events. One specific failure mode is if
the job is requeued on a node with a different number of cores,
and the step terminated RPC arrives later, the job and step
bitmaps of allocated cores can differ in size generating an
abort.
bug 1660

beecc7b0

08 May, 2015 1 commit
- Make sure each job has a wckey if that is something that is tracked. · 0896ca06
  Danny Auble authored May 08, 2015
  
  0896ca06