Commits · f2757d9a1fba579c358535eabe9e4bc74bedef4d · Manuel G. Marciani / ces_slurm_simulator

29 Sep, 2016 8 commits
- Add a couple of examples for adding user to account · f2757d9a
  Morris Jette authored Sep 29, 2016
  
  f2757d9a
- Docs - add Mellanox to contributing organizations list. · 71a39bf4
  Tim Wickberg authored Sep 29, 2016
  
  71a39bf4
- Docs - update SchedMD mailing address. · a561f386
  Tim Wickberg authored Sep 29, 2016
  
  a561f386
- mpi/pmix: provide temp directory for the application. · 901aeeb3
  Artem Polyakov authored Sep 04, 2016
```
Bug 3051.
```
  901aeeb3
- Ignore changes to GRES/QOS when the value is the same as before. · c11e092a
  Tim Wickberg authored Sep 29, 2016
```
Otherwise updates would be rejected for running jobs even if there
would be no impact. Most common when the job_submit plugin is overriding
QOS/GRES values on everything; without this change an update to "comment"
or other fields would fail with ESLURM_JOB_NOT_PENDING.

Bug 3117.
```
  c11e092a
- Cray - add NHC_ABSOLUTELY_NO to SelectTypeParameters. · 7a379920
  Tim Wickberg authored Sep 29, 2016
```
Never ever run NHC, even on an edge case that NHC_NO would still launch NHC
after.

Bug 3105.
```
  7a379920
- Prevent update to last_part_update by partition uid access updates when no change made. · b9af7b9c
  Tim Wickberg authored Sep 29, 2016
```
Switch to list_for_each, and check if access list actually changed
after each update before updating last_prat_update. This prevents
the backfill scheduler from resetting mid-cycle unnecessarily.

Bug 3123.
```
  b9af7b9c
- Correct buffer allocation size · 7e08feae
  Josko Plazonic authored Sep 28, 2016
```
Buffer was being allocated based upon the size of the wrong structure.
  The buffer would have been larger than required, so this bug
  should not have caused any failures.
bug 3127
```
  7e08feae
28 Sep, 2016 2 commits

Fix test to work with DefMemNode configured · 494814e2

Morris Jette authored Sep 28, 2016

Added memory limits to the job and step. Without these, only one
  step may be able to run at a time and break the test

494814e2

Add configuration parameter to control default wait-all-nodes option · 68299d7d

Morris Jette authored Sep 27, 2016

Add "sbatch_wait_nodes" to SchedulerParameters to control default sbatch
    behaviour with respect to waiting for all allocated nodes to be ready for
    use. Job can override the configuration option using the --wait-all-nodes=#
    option.
bug 3120

68299d7d

27 Sep, 2016 3 commits

Validate salloc/sbatch --wait-all-nodes argument · afffca2a

Morris Jette authored Sep 27, 2016

Prior logic would treat execute line like this:
$ sbatch --wait-all-nodes -N3 tmp
with "-N3" as being the argument to the "--wait-all-nodes" option.
See bug 3120

afffca2a

Add job option --use-min-nodes · a5686458

Morris Jette authored Sep 27, 2016

Add salloc/sbatch/srun option --use-min-nodes to prefer smaller node counts
    when a range of node counts is specified (e.g. "-N 2-4").
bug 2996

a5686458

Docs - change html page titles to 'Slurm Workload Manager' · 0763d8a3
Tim Wickberg authored Sep 27, 2016
```
Switch a few SLURM mentions for Slurm as well.
```
0763d8a3

26 Sep, 2016 6 commits
- Correct source of error message, correct function name · 4cb084a3
  Morris Jette authored Sep 26, 2016
  
  4cb084a3
- Move --partition into right place in salloc man page · 3c5e5115
  Morris Jette authored Sep 26, 2016
```
It was out of alphabetic order before (e.g. after --power).
```
  3c5e5115
- Add salloc/sbatch/srun --priority=top option · 62b9884f
  Morris Jette authored Sep 26, 2016
```
Add salloc/sbatch/srun --priority option of "TOP" to set job priority to
    the highest possible value. This option is only available to Slurm operators
    and administrators.
bug 3115
```
  62b9884f
- Change error message to info("warning: .." · 808f8cf2
  Morris Jette authored Sep 26, 2016
```
The problem reported is just a configuration warning and not an
 error. Also change the test from ">=" to ">".
bug 3086
```
  808f8cf2
- Correct KNL web page NUMA description · af49ca7d
  Morris Jette authored Sep 25, 2016
  
  af49ca7d
- Fix CPU mapping for heterogeneous NUMA · ade1be4f
  Morris Jette authored Sep 25, 2016
```
This patch finally resolves absolute/relative CPU mapping for nodes
where the NUMA (or sockets) have different core counts (e.g. KNL SNC4).
```
  ade1be4f
25 Sep, 2016 1 commit
- Fix typo (variable name) in last commit · 11332b5b
  Morris Jette authored Sep 25, 2016
  
  11332b5b
24 Sep, 2016 2 commits
- Task affinity fix for KNL · d3dad218
  Morris Jette authored Sep 23, 2016
  
  d3dad218
- Correct KNL task affinity logic · 40d8672b
  Morris Jette authored Sep 23, 2016
```
bug 3090
```
  40d8672b
23 Sep, 2016 2 commits
- Update SLUG16 agenda to add CSCS site report. · 48480db3
  Tim Wickberg authored Sep 23, 2016
  
  48480db3
- Fix cray requeue bug while step cleaning · fbbc4bdb
  Morris Jette authored Sep 22, 2016
```
Make sure no attempt is made to schedule a requeued job until all steps are
    cleaned (Node Health Check completes for all steps on a Cray).
bug 3082
```
  fbbc4bdb
22 Sep, 2016 10 commits
- BlueGene - scale node count when enforcing MaxNodes limit on a partition. · cb7ed937
  Dominik Bartkiewicz authored Sep 22, 2016
```
Otherwise limit is checking the node count against the midplane count.

Bug 3049.
```
  cb7ed937
- Testsuite - fix test1.83 to handle gaps in node name with --contiguous. · f9cbc9c4
  Alejandro Sanchez authored Sep 22, 2016
```
Check if node names are contiguous with respect to the node list
assigned to the partition, rather than just monotonically increasing.

Bug 3006.
```
  f9cbc9c4
- Ignore rc of xcgroup_delete when cleaning up · 2279d37b
  Brian Christiansen authored Aug 02, 2016
```
xcgroup_delete will print a log message if there is a problem.
```
  2279d37b
- Remove cpuset & devices cgroups after steps finish · 66beca68
  Janne Blomqvist authored Jul 29, 2016
```
Bugs 2681 and 2703

Conflicts:
	NEWS
```
  66beca68
- Fix variable used in sizeof function · 630afd8f
  Morris Jette authored Sep 22, 2016
  
  630afd8f
- [PATCH 2/2] drop call to PMI2_Job_GetId from testpmixring.c · cb572e38
  Adam Moody authored Sep 22, 2016
  
  cb572e38
- [PATCH 1/2] define PMIX_Ring behavior in singleton mode · a4155bbd
  Adam Moody authored Sep 22, 2016
  
  a4155bbd
- Fix typos in docs and "sacct --help". · fbac3e4b
  Gennaro Oliva authored Sep 22, 2016
  
  fbac3e4b
- Updates to KNL web page · a36b4035
  Morris Jette authored Sep 21, 2016
  
  a36b4035
- Fix squeue filter by job license when a job has requested more than 1 · cbd1ffad
  Alejandro Sanchez authored Sep 21, 2016
```
license of a certain type.
```
  cbd1ffad
21 Sep, 2016 5 commits

Docs - fix two typos. · f835d8b6
Tim Wickberg authored Sep 21, 2016

f835d8b6

Increase default CapmcTimeout from 10 to 60 seconds · 1fe5c7cc

Morris Jette authored Sep 21, 2016

node_features/knl_cray plugin: Increase default CapmcTimeout parameter from
    10 to 60 seconds.
bug 3100

1fe5c7cc

capmc_suspend/resume fail rather than operate on individual nodes · f07482fd

Morris Jette authored Sep 20, 2016

capmc_suspend/resume - If a request modify NUMA or MCDRAM state on a set of
    nodes or reboot a set of nodes fails then just requeue the job and abort the
    entire operation rather than trying to operate on individual nodes.
bug 3100

f07482fd

Allow clearing of node PowerUp state flag · 61b14031
Morris Jette authored Sep 20, 2016
```
Allow a node's PowerUp state flag to be cleared using update_node RPC.
bug 3100
```
61b14031

Pass SLURM_JOB_ID to ResumeProgram · 45ca1dd4

Morris Jette authored Sep 20, 2016

When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode)
    then pass to the ResumeProgram the job ID assigned to the nodes in the
    SLURM_JOB_ID environment variable.
bug 3100

45ca1dd4

20 Sep, 2016 1 commit

Remove error message on valid state · 85c136bc

Morris Jette authored Sep 19, 2016

Don't log error for job end_time being zero if node health check is still
    running.
bug 3053

85c136bc