Commits · ea3567c339383ece36ff460d2e34e95e57751159 · Manuel G. Marciani / ces_slurm_simulator

13 Nov, 2013 3 commits

Corrections to advanced reservation logic with overlapping jobs. · d6954b77

Morris Jette authored Nov 13, 2013

This might have worked fine for core reservations or when there
are sufficient idle nodes to use, the the select_g_resv_test()
function clears the node bitmap for nodes that it can not use
and the reservation create logic did not restore that bitmap
after a failed resource selection attempt. This logic restores
the node bitmap on a failed call to select_g_resv_test() so we
can add nodes to the bitmap of available nodes rather than having
it repeatedly cleared.
The logic also adds some performance enhancements that I will
add to in the next commit.

d6954b77

Update NEWS file for commit dc0c4e29 . · 3643c8a9
David Bigagli authored Nov 12, 2013

3643c8a9

Fix bug in job step allocation failing due to memory limit · 21ed817c

Morris Jette authored Nov 12, 2013

This fixes a bug where a system is enforcing memory limits and
the job already has a step running on some of the nodes then
tries to start another step using some of those nodes. For example
wwith DefMemPerNode configured and the select plugin enforcing
memory limits, try:
salloc -N2 bash
$ srun -N1 sleep 10&
$ srun -N2 hostname
Without this patch, the second srun would fail instead of pend.

21ed817c

09 Nov, 2013 1 commit
- Updated NEWS file. · 7f0d404a
  David Bigagli authored Nov 08, 2013
  
  7f0d404a
08 Nov, 2013 1 commit
- Add news about oom check for task/cgroup and minor formatting · 51862f56
  Danny Auble authored Nov 07, 2013
  
  51862f56
05 Nov, 2013 1 commit

Correct hostlist parsing with two brackets · ed4bc269

Morris Jette authored Nov 05, 2013

Correction to hostlist parsing bug introduced in v2.6.4 for hostlists with
more than one numeric range in brackets (e.g. rack[0-3]_blade[0-63]").
bug505

ed4bc269

04 Nov, 2013 6 commits
- Add core specialization web page · 49e5c363
  Morris Jette authored Nov 04, 2013
```
Just a start for now
```
  49e5c363
- Added squeue format option of "%X" (core specialization count). · b4ef1b7b
  Morris Jette authored Nov 04, 2013
  
  b4ef1b7b
- Start NEWS for v13.12.0-pre5 · b74b42c6
  Morris Jette authored Nov 04, 2013
  
  b74b42c6
- Start NEWS for v2.6.5 · 67e7d739
  Morris Jette authored Nov 04, 2013
  
  67e7d739
- Add infrastructure for specialized cores · 4a9cb5a4
  Morris Jette authored Nov 04, 2013
```
Added -S/--core-spec option to salloc, sbatch and srun commands to reserve
    specialized cores for system use. Modify sview and scontrol to set/get
    core_spec
struct job_info / slurm_job_info_t: Added core_spec
struct job_descriptorjob_desc_msg_t: Added core_spec
```
  4a9cb5a4
- Updated the QOS limits documentation and man page. · 1ad0b7f9
  David Bigagli authored Nov 04, 2013
  
  1ad0b7f9
03 Nov, 2013 1 commit

Enlarge max job array task ID to 32-bits · 494f6771

jette authored Nov 02, 2013

The system really can not handle larger job arrays without adding
a job array data structure, but this puts some of the infrastructure
in place now.

494f6771

02 Nov, 2013 1 commit
- Added configuration parameter FairShareDampeningFactor · 790da19c
  Martins Innus authored Nov 01, 2013
```
to offer a greater priority range based upon utilization.
```
  790da19c
01 Nov, 2013 3 commits

Fix for used_cpu_run_secs bad calcuation · 0da4d951

Morris Jette authored Nov 01, 2013

Add argument to priority plugin's priority_p_reconfig function to note
when the association and QOS used_cpu_run_secs field has been reset.
Without this flag, we remove time on "scontrol setdebug" or
"scontrol setdebugflag" that can result in used_cpu_run_secs
going negative or otherwise get bad values.
Correction to logic added in commit 6d793189
bug 423

0da4d951

Fix for used_cpu_run_secs bad calcuation · f247ff3a

Morris Jette authored Nov 01, 2013

Add argument to priority plugin's priority_p_reconfig function to note
when the association and QOS used_cpu_run_secs field has been reset.
Without this flag, we remove time on "scontrol setdebug" or
"scontrol setdebugflag" that can result in used_cpu_run_secs
going negative or otherwise get bad values.
Correction to logic added in commit 6d793189
bug 423

f247ff3a

sched/wiki, sched/wiki2 - Fix to allow job start · 1f2348ab

Morris Jette authored Nov 01, 2013

Fix to work with change logic introduced in Slurm version 2.6.3
scheduling logic which prevented Maui/Moab from starting jobs.

1f2348ab

31 Oct, 2013 1 commit
- Update NEWS file. · 3c078df9
  David Bigagli authored Oct 29, 2013
  
  3c078df9
30 Oct, 2013 3 commits
- Add sgather command · be166e0a
  Matthias Jurenz authored Oct 30, 2013
  
  be166e0a
- qsub command enhancements · f1ac8d11
  Morris Jette authored Oct 29, 2013
```
Add support for -W block=true	(wait for job completion)
Clear PBS_NODEFILE environment variable
Credit to NCSC
```
  f1ac8d11
- Add support for dependency on full job array · 8a09fd4b
  Morris Jette authored Oct 28, 2013
  
  8a09fd4b
29 Oct, 2013 3 commits
- Update NEWS file. · 4a8d73ec
  David Bigagli authored Oct 29, 2013
  
  4a8d73ec
- qsub command enhancements · c269f471
  Morris Jette authored Oct 29, 2013
```
Add support for -W block=true	(wait for job completion)
Clear PBS_NODEFILE environment variable
Credit to NCSC
```
  c269f471
- Add support for dependency on full job array · a4cbabca
  Morris Jette authored Oct 28, 2013
  
  a4cbabca
28 Oct, 2013 5 commits
- Add job array depenency support · 4535783d
  Morris Jette authored Oct 28, 2013
```
Add support for dependencies of job array elements (e.g.
"sbatch --depend=afterok:123_4 ..."). This does not support
depenendencies of ALL job array elements, only individual job
array elements.
```
  4535783d
- BLUEGENE - fix issue where node count wasn't set up correctly when srun · cb143377
  Danny Auble authored Oct 28, 2013
```
preforms the allocation, regression in 2.6.3.
```
  cb143377
- smap - Avoid invalid memory reference with hidden nodes · 53116d6f
  Morris Jette authored Oct 28, 2013
  
  53116d6f
- Fix sacctmgr modify qos set preempt+/-=. · b7d44bd6
  Danny Auble authored Oct 28, 2013
  
  b7d44bd6
- smap - Avoid invalid memory reference with hidden nodes · a9a70c1a
  Morris Jette authored Oct 28, 2013
  
  a9a70c1a
25 Oct, 2013 4 commits

Permit changes to slurmd debug level · 43643ba7

Morris Jette authored Oct 25, 2013

Previously the SlurmdDebug value in slurm.conf was ignored if the
previous value was not 3/init

43643ba7

Multi-thread the sinfo command (one thread per partition) · 17449c06

Morris Jette authored Oct 25, 2013

Effect is minimal without multiple partitions and larger system sizes.
With 40 partitions and about 600 nodes each, time goes from about
13 secs to 4 secs).

17449c06

Sinfo performance improvements · fd3b75a9

Morris Jette authored Oct 25, 2013

This avoids building hostlist information with NodeHostName and
NodeAddr information unless explisitly requested and can improve
performance for the default mode of operation by about 65%.

fd3b75a9

Correct stdout/err name with job id · 2c2fd9f0

Morris Jette authored Oct 24, 2013

Correct sbatch documentation and job_submit/pbs plugin "%j" is job ID,
not "%J" (which is job_id.step_id).

2c2fd9f0

24 Oct, 2013 1 commit

Improve setting of job wait "Reason" field. · cf7ca59b

Morris Jette authored Oct 24, 2013

Without this change a job with a reason of WAIT_PART_DOWN,
WAIT_PART_INACTIVE, WAIT_PART_NODE_LIMIT, WAIT_PART_TIME_LIMIT, or
WAIT_QOS_THRES would not be cleared when that reason no longer
applied.

cf7ca59b

23 Oct, 2013 4 commits

proctrack/cgroup - Fix for race condition · c36d564b

Morris Jette authored Oct 22, 2013

Add cgroup create retry logic in case one step is starting at the
same time as another step is ending and the logic to create
and delete cgroups overlaps.
bug 447

c36d564b

Problem allocating threads with GPUs · 52c2e27f

Morris Jette authored Oct 22, 2013

If a node has GRES and multiple threads per core the select/cons_res
plugin can get stuck in an infinite loop.
See bug 475
Contributed by:
PREVOST Ludovic
NEC HPC Europe

52c2e27f

Document latest patch changes in NEWS · e93a6543
Morris Jette authored Oct 22, 2013

e93a6543
Enforce JobRequeue configuration parameter · d7dfa58e
Morris Jette authored Oct 21, 2013
```
Previously a node failure would always requeue the job
```
d7dfa58e

22 Oct, 2013 2 commits

proctrack/cgroup - Fix for race condition · 260c5485

Morris Jette authored Oct 22, 2013

Add cgroup create retry logic in case one step is starting at the
same time as another step is ending and the logic to create
and delete cgroups overlaps.
bug 447

260c5485

Problem allocating threads with GPUs · dab7fb02

Morris Jette authored Oct 22, 2013

If a node has GRES and multiple threads per core the select/cons_res
plugin can get stuck in an infinite loop.
See bug 475
Contributed by:
PREVOST Ludovic
NEC HPC Europe

dab7fb02