Commits · aa8e46a1ef08c67310c7cabd0aa84bcf8306da50 · Manuel G. Marciani / ces_slurm_simulator

08 Dec, 2015 6 commits
- Add KNL documentation · aa8e46a1
  Morris Jette authored Dec 08, 2015
  
  aa8e46a1
- KNL: Add node features based upon slurm.conf · 0c60acef
  Morris Jette authored Dec 08, 2015
  
  0c60acef
- Refactor knl.conf reading logic · 804e2b68
  Morris Jette authored Dec 08, 2015
  
  804e2b68
- job_submit/knl: add job_modify support · cbef7767
  Morris Jette authored Dec 08, 2015
  
  cbef7767
- Set job's KNL features on submit · 383a0422
  Morris Jette authored Dec 08, 2015
  
  383a0422
- Start of KNL infrastructure · c12b3b01
  Morris Jette authored Dec 07, 2015
  
  c12b3b01
07 Dec, 2015 11 commits
- Merge branch 'slurm-15.08' · 5955a91f
  Morris Jette authored Dec 07, 2015
  
  5955a91f
- Document Elasticsearch DebugFlag · a7072f68
  Morris Jette authored Dec 07, 2015
  
  a7072f68
- Fix for commit 02b0923a to work on systems without lua · 10252ab7
  Danny Auble authored Dec 07, 2015
  
  10252ab7
- Rewrite of 9d862587 · 02b0923a
  Danny Auble authored Dec 07, 2015
  
  02b0923a
- Merge branch 'slurm-15.08' · 427fa736
  Morris Jette authored Dec 07, 2015
```
Conflicts:
	src/plugins/jobcomp/elasticsearch/jobcomp_elasticsearch.c
```
  427fa736
- jobcomp/elasticsearch: reduce verbose logging · f25aefd8
  Morris Jette authored Dec 07, 2015
```
Avoid logging every time that some interaction with the elasticsearch
server happens. That would generate too many log messages.
Fix allignment problem in the code without changing logic.
```
  f25aefd8
- Add DEBUG_FLAG_ESEARCH · 0a499a2e
  Alejandro Sanchez authored Dec 07, 2015
  
  0a499a2e
- jobcomp/elasticsearch: limit size of job queue · 494f280c
  Alejandro Sanchez authored Dec 07, 2015
```
Added a limit of MAX_JOBS 100000 jobs enqueued and waiting to be indexed
```
  494f280c
- jobcomp/elasticsearch: Reduced MAX_STR_LEN to 10240 (10KB) · d37c72d1
  Alejandro Sanchez authored Dec 07, 2015
  
  d37c72d1
- update job/script file synchronization · 41eead64
  Morris Jette authored Dec 07, 2015
```
When slurmctld is reconfigured, it makes sure that every pending
batch job has a script and if not mark the job as failed. If the
script is present with no job, then purge the file/directory.

This re-write of the logic takes advantage of hash tables to
improve the performance of the algorithm and is an improvement
on the logic in commit 862f3c8e
and will work with the version 15.08 code base as well
Time for for slurmctld reconfig with 110,000 jobs/files
37.8 sec with original code
2.2  sec with commit 862f3c8e
0.8  sec with latest code
bug 2220
```
  41eead64
- [PATCH] Fix documentation for AllowUsers in burst_buffer.conf. · df393a48
  Tim Wickberg authored Dec 07, 2015
```
Usernames are comma separated, not colon delimited. Bug #2222.
While here fix a few spelling mistakes.
```
  df393a48
06 Dec, 2015 9 commits

Merge branch 'slurm-15.08' · 4d977cd8
jette authored Dec 06, 2015

4d977cd8
[PATCH] Fix issue in slurm_cred_copy · 672983e6
Sourav Chakraborty authored Dec 06, 2015
```
Fixed a typo that was causing slurm_cred_copy to segfault
```
672983e6

Fix tests for odd configuration · 3d4b308a

jette authored Dec 05, 2015

This adds some job memory limits to avoid tests failing with a
configuration having SelectTypeParameters=CR_Memory and NO
default memory size.

3d4b308a

Merge branch 'slurm-15.08' · 2b0090c2
jette authored Dec 05, 2015

2b0090c2
Fix test for memory limit enforcement · 94250943
jette authored Dec 05, 2015

94250943
Merge branch 'slurm-15.08' · 94b0a179
jette authored Dec 05, 2015
```
Conflicts:
	src/plugins/jobcomp/elasticsearch/jobcomp_elasticsearch.c
```
94b0a179

Insure job signal resent on requeue · d0692ec2

Tim Wickberg authored Dec 05, 2015

If a job is configured to be sent a signal when approaching its
timeout and the job is requeued after the signal is sent, then
resend the signal at the approriate time after requeue.
bug 1415

d0692ec2

Speed up backfill schedule logic · 91d0e7f5

jette authored Dec 05, 2015

If the backfill scheduler can not start at the appointed time, say
because there are too many pending RPC, then only sleep 1 second
and try again rather than the full bf_interval (30 seconds by default).

91d0e7f5

Remove redundant function · 88be9c18

jette authored Dec 05, 2015

Same logic was in both backfill plugin and slurmctld/job_scheduler.c
No change in logic

88be9c18

05 Dec, 2015 3 commits
- Add new hook to Task plugin to be able to put adopted processes in the step_extern cgroups. · 7f39ab4f
  Brian Christiansen authored Dec 04, 2015
```
Bug 2130
```
  7f39ab4f
- Let devices step_extern cgroup inherit attributes of job cgroup. · 3101754f
  Brian Christiansen authored Dec 04, 2015
```
Adopted processes didn't have access to the job's devices.

Bug 2130
```
  3101754f
- Minor doc updates. · e50cba7a
  Brian Christiansen authored Dec 04, 2015
  
  e50cba7a
04 Dec, 2015 6 commits

Streamline job array/file test logic · 862f3c8e

Morris Jette authored Dec 04, 2015

When slurmctld processes a SIGHUP or "scontrol reconfig" it
makes sure that every pending batch job as a directory containing
script and environment files, cancelling any job that lacks the
files and purging any files without a job. When there are a large
number of jobs (in any state) and a large number of files, the
nested loops to do this can run for a very long time (minutes).
This change streamlines the processing for job arrays, which now
avoid creating their own script and environment files by using that
of their job array master record.

862f3c8e

Fix abort of srun if using PrologFlags=NoHold · 64550ed1

Danny Auble authored Dec 04, 2015

Full revert of c2fbf88f, 13b64c35 had caught part of this, but
this will revert it completely.  The code just wasn't needed in modern
Slurm.  It appears the patch came from an older version of Slurm that
didn't handle this correctly.

64550ed1

Revert "Wait for prolog is already running." · dbe9de84

David Bigagli authored Dec 03, 2015

This reverts commit 29f25688.

Conflicts:
	NEWS

Looks like this isn't needed, commit c2fbf88f doesn't appear to be needed
and is what is causing this issue.  c2fbf88f was added from an older
version of Slurm where this was already handled correctly in commit
815e5a44.

dbe9de84

rewrite of . It appeared with the referenced commit we would · 9f98610d

Danny Auble authored Dec 04, 2015

get messages like

slurmctld: error: _handle_assoc_tres_run_secs: job 33355: assoc 1 TRES node grp_used_tres_run_secs underflow, tried to remove 21 seconds when only 0 remained.

which are bad.

9f98610d

jobcomp/elasticsearch: limit script pack size to 1MB · 04fb8d40
Morris Jette authored Dec 04, 2015

04fb8d40
jobcomp/elasticsearch: Add timer to retry logic · 8d6232ba
Alejandro Sanchez authored Dec 04, 2015
```
If a store fails, do not retry for 30 seconds
```
8d6232ba

03 Dec, 2015 5 commits

Change cray NHC timing · 0f4dd55f

Morris Jette authored Dec 03, 2015

Cray job NHC delayed until after burst buffer released and epilog completes
    on all allocated nodes.
bugs 2099 and 2192

0f4dd55f

Merge remote-tracking branch 'origin/slurm-15.08' · 993958e2
Brian Christiansen authored Dec 03, 2015

993958e2

burst_buffer work · 2782d824

Morris Jette authored Dec 03, 2015

Add new buffer state of BB_STATE_POST_RUN indicating the post_run
  operation has been started
Add new bb_p_job_test_post_run function to test status of post_run
  operation
Execute post_run operation before the stage_out operation to opimize
  use of compute nodes (rather than burst buffer space)

2782d824

Move license release logic · 2e3b58ed

Morris Jette authored Dec 03, 2015

Release a job's allocated licenses only after epilog runs on all nodes
    rather than at start of termination process.
bug 2192

2e3b58ed

CompleteWait fix · 3d86ac42

Morris Jette authored Dec 02, 2015

sched/backfill - Delay backfill scheduler for completing jobs only if
    CompleteWait configuration parameter is set (make code match documentation).

3d86ac42