Commits · a034e4ab8080b49c8725b467ce2b7401a576b71d · Manuel G. Marciani / ces_slurm_simulator

19 Jun, 2017 3 commits
- Better checking when a job is finishing to avoid underflow on job's · a034e4ab
  Danny Auble authored Jun 19, 2017
```
submitted to a QOS/association.

Bug 3849
```
  a034e4ab
- Add comment to explain concerning assignment. · ceb491a9
  Danny Auble authored Jun 19, 2017
  
  ceb491a9
- Corect ClusterName state file check logging · 64c3b421
  Morris Jette authored Jun 19, 2017
```
Correct error message when ClusterName in configuration files does not match
    the name in the slurmctld daemon's state save file.
```
  64c3b421
15 Jun, 2017 2 commits
- Fix --ntasks-per-core option/environment variable parsing to set · 1b2a6dc3
  Danny Auble authored Jun 15, 2017
```
the requested value, instead of always setting one.

This would make --hint=multithread not work at all.

See Bug 3855 (commit 3c852da1)

Issue originated from commit 82a959a8.
```
  1b2a6dc3
- Fix for job step task layout with --cpus-per-task option · 81372cc0
  Dominik Bartkiewicz authored Jun 15, 2017
```
bug 3447
```
  81372cc0
14 Jun, 2017 3 commits
- Only make the extern step at job creation. · 9d32c100
  Danny Auble authored Jun 14, 2017
```
Turns out if the extern step is created here and the job was killed long
before hand the step is made erroneously and can cause an assert just lines
later.

Bug 3898
```
  9d32c100
- Continuation of last commit. This allows srun inside an allocation to · 25099c1a
  Danny Auble authored Jun 13, 2017
```
specify an alternative --ntasks-per-*
```
  25099c1a
- Make sure srun inside an allocation gets --ntasks-per-[core|socket] · a3a3d368
  Tim Shaw authored Jun 13, 2017
```
set correctly.

Bug 3858
```
  a3a3d368
13 Jun, 2017 4 commits

Add missing NEWS entry for 23721c4c . · 7add853c
Tim Wickberg authored Jun 13, 2017

7add853c

Add LaunchParameters option of cray_net_exclusive. · 23721c4c

Tim Wickberg authored May 19, 2017

Changes the alpsc_configure_nic() call to set the exclusive flag,
and 100 for both the cpu and memory scaling values.

Should only be used with exclusive jobs without concurrent steps
running on a node, otherwise oversubscription of the GNI resources
can occur leading to performance issues.

Bug 3713.

23721c4c

Improve slurmd startup on large systems (> 10000 nodes) · 9b7210ef

Danny Auble authored Jun 13, 2017

What this does is populate the node_hash_table as nodes are being read in
instead of after the node_record_table_ptr has been fully populated.

This speeds up a start of a slurmd with a system of 10000 nodes from
> 1 minute to less than a second.

In 17.11 we will remove the linear xstrcmp check as it should no longer be
needed.

Bug 3885

9b7210ef

Move function to be in the static functions (and defined that way). No real · 2b2d5e33
Danny Auble authored Jun 13, 2017
```
code change.

Bug 3885
```
2b2d5e33

12 Jun, 2017 5 commits
- Typo, double the word "any" · 4d851688
  bamb0u authored Jun 12, 2017
  
  4d851688
- Note when a job finishes in the slurmd to avoid a race when launching a · 55bcec87
  Danny Auble authored Jun 12, 2017
```
batch job takes longer than it takes to finish.

Bug 3833
```
  55bcec87
- Increase number of jobs that are tracked in the slurmd as finishing at one · ccfe2552
  Danny Auble authored Jun 12, 2017
```
time.

Bug 3833
```
  ccfe2552
- Fix bug in task/affinity that could result in slurmd fatal error · 6dd2be3b
  Morris Jette authored Jun 12, 2017
```
An array was only being partially cleared due to bad logic
bug 3876
```
  6dd2be3b
- Only set kmem cgroup limit if ConstrainKmemSpace=yes · ba32ac48
  Tim Wickberg authored Jun 09, 2017
```
Bug 3874.
```
  ba32ac48
09 Jun, 2017 1 commit
- Correction to commit 47b5fe60 to eliminate memory leak · a54455c4
  Morris Jette authored Jun 09, 2017
  
  a54455c4
08 Jun, 2017 2 commits

Improve preempted job selection logic · 47b5fe60

Dominik Bartkiewicz authored Jun 08, 2017

Improve selection of jobs to preempt when there are multiple partitions
    with jobs subject to preemption.
bug 3824

47b5fe60

Handle update of blocking QOS pointers correctly. · 5e92a3f5
Dominik Bartkiewicz authored Jun 08, 2017
```
Prevent segfault from pointer dereference to the QOS that is
being deleted.

Fix to commit 3e8aa451.
```
5e92a3f5

07 Jun, 2017 2 commits
- Update NEWS for 17.02.5. · 02106184
  Tim Wickberg authored Jun 07, 2017
  
  02106184
- Update META for v17.02.4 tag · 3db13036
  Tim Wickberg authored Jun 07, 2017
  
  3db13036
06 Jun, 2017 1 commit
- Add X11 SPANK plugin build instructions to FAQ · 39d0a10c
  Morris Jette authored Jun 06, 2017
  
  39d0a10c
03 Jun, 2017 1 commit

Fix issue with sacctmgr show where user='' · bef69448

Danny Auble authored Jun 02, 2017

Fix regression from commit c05dcb8a (bug 1923) that doesn't take
into consideration a blank char * as a valid option.

This fixes the scenario like

sacctmgr list associations user=''

which would only print account associations.
Bug 3862

bef69448

02 Jun, 2017 2 commits

If trying to cancel a step that hasn't started yet for some reason return · eed76f85

Danny Auble authored Jun 02, 2017

a good return code.

This also fixes the situation where the step was ending but not yet ended
so it sends the KILL_TASK_FAILED error instead of JOB_NOTRUNNING.

Also it removes the abort in favor of exit which it should had been anyways.

Bug 3758

eed76f85

Fix regression from commit 3e8aa451. (wrong list given in · 59a820ad
Dominik Bartkiewicz authored Jun 02, 2017
```
list_for_each)
```
59a820ad

01 Jun, 2017 10 commits

Fix --ntasks-per-core parsing in sbatch command. · 3c852da1
Mark Klein authored Jun 01, 2017
```
Inadvertently set to one when requested.

Bug 3855.
```
3c852da1
Always generate core in slurmd, even on non-developer builds. · 7d488e2b
Tim Wickberg authored Jun 01, 2017
```
Bug 3857.
```
7d488e2b
Note why we use list_dequeue here. · dd0f7e4e
Danny Auble authored Jun 01, 2017

dd0f7e4e
While not needed, put a list_destroy function in the creation of · 7d94f6d3
Danny Auble authored Jun 01, 2017
```
purge_files_list.
```
7d94f6d3
Put in the word 'extern' for consistancy sake. · 6c45c680
Danny Auble authored Jun 01, 2017

6c45c680

Handle file deletion for purge_old_job() in a separate thread. · b9719be2

Tim Wickberg authored May 24, 2017

File deletion can be slow, especially when StateSaveLocation in on
NFS or other network filesystems. Since purge_old_job() holds all
the slurmctld write locks, this is especially performance sensitive.

Moving this to an independent thread lets the slower filesystem
cleanup happen without owning these locks. purge_old_job() then
results in the purged job ids being queued in the purge_list.

A race with the job id potentially wrapping around again is already
prevented by _dup_job_file_test() in get_next_job_id().

Bug 3763.

b9719be2

Make _delete_job_details a static function. · ce2cd1b2
Tim Wickberg authored May 24, 2017
```
Only called from _list_delete_job once the MinJobAge has
passed.
```
ce2cd1b2

Remove timeout code from job_purge_old. · 843e5d38

Tim Wickberg authored May 24, 2017

This will need to be handled differently. The timeout can
lead to the purge process falling further and further behind
on high throughput systems if the number of job scripts that
can be deleted within a second is lower than the job submission
and completion rate of the cluster, eventually leading to
the MaxJobCount limit being reached.

Bug 3763.

843e5d38

Better commit from last · cff4e661
Danny Auble authored Jun 01, 2017

cff4e661
Update test to print warning about rxvt/aterm ignoring SIGFPE. · 3b20bc1d
Danny Auble authored Jun 01, 2017

3b20bc1d

31 May, 2017 4 commits
- Fix test to print out the slurmd node name instead of gethostname so · 9f2b3ffe
  Danny Auble authored May 31, 2017
```
it works better on multi-slurmd installs.
```
  9f2b3ffe
- Docs - clarify performance issues from ConstrainRAMSpace. · 2e833147
  Tim Wickberg authored May 31, 2017
```
Revert some of my b50f4661. Elaborate on tradeoffs, and point
to HTC page as well which is a better location for this info.
```
  2e833147
- Add warning about libcurl-devel not being installed during configure. · 0e582365
  Danny Auble authored May 31, 2017
  
  0e582365
- Docs - remove reference to ConstrainRAMSpace in HTC. · b50f4661
  Tim Wickberg authored May 31, 2017
```
This is better discussed in the high_throughput.shtml doc.

Also, "Contrain" is misspelled adding to the confusion.
```
  b50f4661