Commits · 38f072ed87d1150fabc37dd04ac5daf20bdb472b · Manuel G. Marciani / ces_slurm_simulator

05 Oct, 2016 1 commit

Remove KNL features from non-KNL node · 38f072ed

Morris Jette authored Oct 04, 2016

node_features/knl_cray plugin: Remove any KNL MCDRAM or NUMA features from
node's configuration if capmc does NOT report the node as being KNL.
For example, we don't want a non-KNL node with features="quad,cache".

38f072ed

04 Oct, 2016 1 commit

add knl.conf parameter CapmcRetries · 5cb90497

Morris Jette authored Oct 04, 2016

Add new knl.conf configuration parameter CapmcRetries
Modify capmc_suspend and capmc_resume to retry operations when
  Cray State Manager is down.
Add retry logic to node_features/knl_cray to handle Cray State
  manager being down.
bug 3100

5cb90497

03 Oct, 2016 1 commit
- Fix sbatch wait short option 'W' Bug 3137 · 53d456ae
  Dominik Bartkiewicz authored Oct 03, 2016
  
  53d456ae
30 Sep, 2016 4 commits
- Add client side range checks for --nice values. · 2aefc66b
  Alejandro Sanchez authored Sep 30, 2016
```
Otherwise they'll truncate when packed into the RPC and end up
as some bizarre value at the controller.

Bug 3098.
```
  2aefc66b
- Search for pending jobs as well with 'sacctmgr show RunawayJobs'. · 00735315
  Dominik Bartkiewicz authored Sep 30, 2016
```
Set completed time for pending/running runaway jobs to the max of
(start, eligible, submit) times.

Bug 3075
```
  00735315
- mpi/pmix - improve the performance of point to point communication · ee4a9776
  Artem Polyakov authored Sep 30, 2016
```
Avoid using slurm_forward_data because it causes thread spawn that
introduces unwanted delays.

Bug 3102.
```
  ee4a9776
- Docs - correct default value for GroupUpdateForce is 0. · 55555e8a
  Tim Wickberg authored Sep 30, 2016
  
  55555e8a
29 Sep, 2016 6 commits
- Start NEWS for v16.05.6 release · 1a862096
  Morris Jette authored Sep 29, 2016
  
  1a862096
- Fix display for nice and priority values in sprio/scontrol/squeue. · df70b651
  Alejandro Sanchez authored Sep 29, 2016
```
Also correct the value of NICE_OFFSET used within the perl API.

Bug 3098.
```
  df70b651
- mpi/pmix: provide temp directory for the application. · 901aeeb3
  Artem Polyakov authored Sep 04, 2016
```
Bug 3051.
```
  901aeeb3
- Ignore changes to GRES/QOS when the value is the same as before. · c11e092a
  Tim Wickberg authored Sep 29, 2016
```
Otherwise updates would be rejected for running jobs even if there
would be no impact. Most common when the job_submit plugin is overriding
QOS/GRES values on everything; without this change an update to "comment"
or other fields would fail with ESLURM_JOB_NOT_PENDING.

Bug 3117.
```
  c11e092a
- Cray - add NHC_ABSOLUTELY_NO to SelectTypeParameters. · 7a379920
  Tim Wickberg authored Sep 29, 2016
```
Never ever run NHC, even on an edge case that NHC_NO would still launch NHC
after.

Bug 3105.
```
  7a379920
- Prevent update to last_part_update by partition uid access updates when no change made. · b9af7b9c
  Tim Wickberg authored Sep 29, 2016
```
Switch to list_for_each, and check if access list actually changed
after each update before updating last_prat_update. This prevents
the backfill scheduler from resetting mid-cycle unnecessarily.

Bug 3123.
```
  b9af7b9c
28 Sep, 2016 1 commit

Add configuration parameter to control default wait-all-nodes option · 68299d7d

Morris Jette authored Sep 27, 2016

Add "sbatch_wait_nodes" to SchedulerParameters to control default sbatch
    behaviour with respect to waiting for all allocated nodes to be ready for
    use. Job can override the configuration option using the --wait-all-nodes=#
    option.
bug 3120

68299d7d

27 Sep, 2016 2 commits

Validate salloc/sbatch --wait-all-nodes argument · afffca2a

Morris Jette authored Sep 27, 2016

Prior logic would treat execute line like this:
$ sbatch --wait-all-nodes -N3 tmp
with "-N3" as being the argument to the "--wait-all-nodes" option.
See bug 3120

afffca2a

Add job option --use-min-nodes · a5686458

Morris Jette authored Sep 27, 2016

Add salloc/sbatch/srun option --use-min-nodes to prefer smaller node counts
    when a range of node counts is specified (e.g. "-N 2-4").
bug 2996

a5686458

26 Sep, 2016 1 commit

Add salloc/sbatch/srun --priority=top option · 62b9884f

Morris Jette authored Sep 26, 2016

Add salloc/sbatch/srun --priority option of "TOP" to set job priority to
    the highest possible value. This option is only available to Slurm operators
    and administrators.
bug 3115

62b9884f

24 Sep, 2016 1 commit
- Correct KNL task affinity logic · 40d8672b
  Morris Jette authored Sep 23, 2016
```
bug 3090
```
  40d8672b
23 Sep, 2016 1 commit

Fix cray requeue bug while step cleaning · fbbc4bdb

Morris Jette authored Sep 22, 2016

Make sure no attempt is made to schedule a requeued job until all steps are
    cleaned (Node Health Check completes for all steps on a Cray).
bug 3082

fbbc4bdb

22 Sep, 2016 5 commits
- BlueGene - scale node count when enforcing MaxNodes limit on a partition. · cb7ed937
  Dominik Bartkiewicz authored Sep 22, 2016
```
Otherwise limit is checking the node count against the midplane count.

Bug 3049.
```
  cb7ed937
- Testsuite - fix test1.83 to handle gaps in node name with --contiguous. · f9cbc9c4
  Alejandro Sanchez authored Sep 22, 2016
```
Check if node names are contiguous with respect to the node list
assigned to the partition, rather than just monotonically increasing.

Bug 3006.
```
  f9cbc9c4
- Remove cpuset & devices cgroups after steps finish · 66beca68
  Janne Blomqvist authored Jul 29, 2016
```
Bugs 2681 and 2703

Conflicts:
	NEWS
```
  66beca68
- [PATCH 2/2] drop call to PMI2_Job_GetId from testpmixring.c · cb572e38
  Adam Moody authored Sep 22, 2016
  
  cb572e38
- Fix squeue filter by job license when a job has requested more than 1 · cbd1ffad
  Alejandro Sanchez authored Sep 21, 2016
```
license of a certain type.
```
  cbd1ffad
21 Sep, 2016 4 commits

Increase default CapmcTimeout from 10 to 60 seconds · 1fe5c7cc

Morris Jette authored Sep 21, 2016

node_features/knl_cray plugin: Increase default CapmcTimeout parameter from
    10 to 60 seconds.
bug 3100

1fe5c7cc

capmc_suspend/resume fail rather than operate on individual nodes · f07482fd

Morris Jette authored Sep 20, 2016

capmc_suspend/resume - If a request modify NUMA or MCDRAM state on a set of
    nodes or reboot a set of nodes fails then just requeue the job and abort the
    entire operation rather than trying to operate on individual nodes.
bug 3100

f07482fd

Allow clearing of node PowerUp state flag · 61b14031
Morris Jette authored Sep 20, 2016
```
Allow a node's PowerUp state flag to be cleared using update_node RPC.
bug 3100
```
61b14031

Pass SLURM_JOB_ID to ResumeProgram · 45ca1dd4

Morris Jette authored Sep 20, 2016

When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode)
    then pass to the ResumeProgram the job ID assigned to the nodes in the
    SLURM_JOB_ID environment variable.
bug 3100

45ca1dd4

20 Sep, 2016 1 commit

Remove error message on valid state · 85c136bc

Morris Jette authored Sep 19, 2016

Don't log error for job end_time being zero if node health check is still
    running.
bug 3053

85c136bc

17 Sep, 2016 1 commit

Restore ability to manually power down nodes · da722a89

Morris Jette authored Sep 16, 2016

Restore ability to manually power down nodes, broken in 15.08.12
in commit b4904661

The patch introduced in commit b4904661 (not powering down dead node) has a bad side effect. Adding the "(node_ptr->last_idle != 0)" condition prevents from powering down nodes with the following command:

scontrol update nodename=nX state=power_down

because the state update function relies on zeroing the "last_idle" variable when a power_down is requested (see src/slurmctld/node_mgr.c, line 1589).

Reverting this commit should solve the problem...but I let you decide...

Didier GAZEN

da722a89

16 Sep, 2016 1 commit

Update KNL modes for out-of-band reboot · 3a465f80

Morris Jette authored Sep 15, 2016

node_features/knl_cray: If a node is rebooted outside of Slurm's direction,
    update it's active features with current MCDRAM and NUMA mode information.
bug 3071

3a465f80

15 Sep, 2016 2 commits
- node_features/knl_cray: Fix MCDRAM state source · bbaa48b2
  Morris Jette authored Sep 15, 2016
```
Fix race condition that could result in MCDRAM state information coming
  from capmc rather than cnselect (used state for next boot rather than
  latest boot).
bug 3080
```
  bbaa48b2
- Docs - assorted spelling corrections. · 09d20dc3
  Nicolas Joly authored Sep 15, 2016
  
  09d20dc3
14 Sep, 2016 2 commits
- Silence srun warning message when intentionally lowering task-per-node count for a step. · daacf5af
  Alejandro Sanchez authored Sep 14, 2016
```
No functional change, just silencing the warning message in this instance.

Bug 3079.
```
  daacf5af
- Docs - mention SLURM_MEM_PER_CPU/NODE environment variables and allowed suffixes. · f62d3e70
  Alejandro Sanchez authored Sep 14, 2016
```
Bug 3073.
```
  f62d3e70
09 Sep, 2016 3 commits
- Modify srun task completion handling · 166e3bec
  Morris Jette authored Sep 09, 2016
```
Modify srun task completion handling to only build the task/node string for
    logging purposes if it is needed. Modified for performance purposes.
bug 3044
```
  166e3bec
- Revert "Fix issue filtering licenses for output with squeue." · 9c4eabed
  Tim Wickberg authored Sep 09, 2016
```
This reverts commit 1ec2a4ae.
```
  9c4eabed
- Fix issue filtering licenses for output with squeue. · 1ec2a4ae
  Alejandro Sanchez authored Sep 09, 2016
```
Bug 3063.
```
  1ec2a4ae
08 Sep, 2016 1 commit

Restructure srun task_exit logic · 6b6d4e1a

Morris Jette authored Sep 08, 2016

Restructure srun command locking for task_exit processing logic for improved
  parallelism. This change decreases the amount of time consumed by serial
  logic by 2 orders of magnitude.
bug 3044

6b6d4e1a

07 Sep, 2016 1 commit

Preserve node "RESERVATION" state · 5eee1d28

Morris Jette authored Sep 07, 2016

Preserve node "RESERVATION" state when one of multiple overlapping
    reservations ends. Previous logic would clear the node's
    RESERVATION state flag when any one of the reservations on the
    node ended rather than keeping the node in RESERVATION state
    until the last reservation ended.
bug 3057

5eee1d28