Commits · c6217cdef8a9e3990acc5a584bdf1f7c0d419a43 · Manuel G. Marciani / ces_slurm_simulator

07 Oct, 2016 2 commits

Optimize allocation for --spread-job job option · c6217cde

Morris Jette authored Oct 07, 2016

Previous logic would result in sub-optimal job allocations and
  under some conditions invalid memory references resulting in
  the slurmctld daemon crashing or aborting.
bug 2732

c6217cde

Update NEWS for commit 0e5b2786 · badf3477
Morris Jette authored Oct 07, 2016

badf3477

06 Oct, 2016 5 commits
- Fix waiting reason if a job is waiting for a specific limit instead of · 8983e920
  Danny Auble authored Oct 06, 2016
```
always just AccountingPolicy.
```
  8983e920
- Fix configuring pmix support when you have your lib dir symlinked to lib64. · daf91a80
  Danny Auble authored Oct 06, 2016
  
  daf91a80
- Fix typo when an error occurs when discovering pmix version on · d75f49e3
  Danny Auble authored Oct 06, 2016
```
configure.
```
  d75f49e3
- Added srun/salloc/sbatch tests with --use-min-nodes option. · 37161057
  Alejandro Sanchez authored Oct 06, 2016
```
Bug 3124.
```
  37161057
- node_features plugin - add "mode" argument · 85e2326e
  Morris Jette authored Oct 05, 2016
```
node_features plugin - Add "mode" argument to node_features_p_node_xlate()
    function to fix some bugs updating a node's features using the node update
    RPC. Without this change it is impossible to clear the active features of
    a node or reset non-KNL node features.
```
  85e2326e
05 Oct, 2016 6 commits

node_features/knl_cray: Validate nodes at start and reconfig · 397752b3

Morris Jette authored Oct 04, 2016

node_features/knl_cray plugin: drain any node not reported by
    "capmc node_status" on startup or reconfig. Also re-tests
    on failed node restart for job.

397752b3

Remove KNL features from non-KNL node · 75049242

Morris Jette authored Oct 04, 2016

node_features/knl_cray plugin: Remove any KNL MCDRAM or NUMA features from
node's configuration if capmc does NOT report the node as being KNL.
For example, we don't want a non-KNL node with features="quad,cache".

75049242

add knl.conf parameter CapmcRetries · 9218a40f

Morris Jette authored Oct 04, 2016

Add new knl.conf configuration parameter CapmcRetries
Modify capmc_suspend and capmc_resume to retry operations when
  Cray State Manager is down.
Add retry logic to node_features/knl_cray to handle Cray State
  manager being down.
bug 3100

9218a40f

node_features/knl_cray plugin: streamline node update · db23662d

Morris Jette authored Oct 04, 2016

node_features/knl_cray plugin: Substantially streamline and speed up logic
to load current node state on reconfigure failure or unexpected node boot.
Completely eliminate capmc calls and just use cnselect to load current
node mode information.

db23662d

node_features/knl_cray: Validate nodes at start and reconfig · 59b118bf

Morris Jette authored Oct 04, 2016

node_features/knl_cray plugin: drain any node not reported by
    "capmc node_status" on startup or reconfig. Also re-tests
    on failed node restart for job.

59b118bf

Remove KNL features from non-KNL node · 38f072ed

Morris Jette authored Oct 04, 2016

38f072ed

04 Oct, 2016 1 commit

add knl.conf parameter CapmcRetries · 5cb90497

Morris Jette authored Oct 04, 2016

Add new knl.conf configuration parameter CapmcRetries
Modify capmc_suspend and capmc_resume to retry operations when
  Cray State Manager is down.
Add retry logic to node_features/knl_cray to handle Cray State
  manager being down.
bug 3100

5cb90497

03 Oct, 2016 1 commit
- Fix sbatch wait short option 'W' Bug 3137 · 53d456ae
  Dominik Bartkiewicz authored Oct 03, 2016
  
  53d456ae
30 Sep, 2016 8 commits
- Add client side range checks for --nice values. · 2aefc66b
  Alejandro Sanchez authored Sep 30, 2016
```
Otherwise they'll truncate when packed into the RPC and end up
as some bizarre value at the controller.

Bug 3098.
```
  2aefc66b
- Search for pending jobs as well with 'sacctmgr show RunawayJobs'. · 00735315
  Dominik Bartkiewicz authored Sep 30, 2016
```
Set completed time for pending/running runaway jobs to the max of
(start, eligible, submit) times.

Bug 3075
```
  00735315
- Add new SchedulerParameters option · 63c2214b
  Morris Jette authored Sep 30, 2016
```
Added new SchedulerParameters options step_retry_count and step_retry_time
    to control scheduling behaviour of job steps waiting for resources.
bug 3121
```
  63c2214b
- Rewrite srun pending step logic · 8f2ef7a5
  Morris Jette authored Sep 30, 2016
```
This change can reduce RPCs sent by srun up to 50%.
```
  8f2ef7a5
- mpi/pmix - improve the performance of point to point communication · ee4a9776
  Artem Polyakov authored Sep 30, 2016
```
Avoid using slurm_forward_data because it causes thread spawn that
introduces unwanted delays.

Bug 3102.
```
  ee4a9776
- Docs - correct default value for GroupUpdateForce is 0. · 55555e8a
  Tim Wickberg authored Sep 30, 2016
  
  55555e8a
- Avoid creating duplicate pending step records · 3a51703e
  Morris Jette authored Sep 29, 2016
```
THis can happen due to signal/retry logic and the result is
  vestigial srun ports that are no longer valid and slurmctld
  sending messages to invalid ports to notify them about available
  resources as other steps complete
```
  3a51703e
- Add srun host & PID to job step data structures · 5a95e7ff
  Morris Jette authored Sep 29, 2016
  
  5a95e7ff
29 Sep, 2016 9 commits
- Update NEWS for start of v17.02.0-pre3 · 513b23ba
  Morris Jette authored Sep 29, 2016
  
  513b23ba
- Start NEWS for v16.05.6 release · 1a862096
  Morris Jette authored Sep 29, 2016
  
  1a862096
- Fix display for nice and priority values in sprio/scontrol/squeue. · df70b651
  Alejandro Sanchez authored Sep 29, 2016
```
Also correct the value of NICE_OFFSET used within the perl API.

Bug 3098.
```
  df70b651
- mpi/pmix: provide temp directory for the application. · 901aeeb3
  Artem Polyakov authored Sep 04, 2016
```
Bug 3051.
```
  901aeeb3
- Ignore changes to GRES/QOS when the value is the same as before. · c11e092a
  Tim Wickberg authored Sep 29, 2016
```
Otherwise updates would be rejected for running jobs even if there
would be no impact. Most common when the job_submit plugin is overriding
QOS/GRES values on everything; without this change an update to "comment"
or other fields would fail with ESLURM_JOB_NOT_PENDING.

Bug 3117.
```
  c11e092a
- Cray - add NHC_ABSOLUTELY_NO to SelectTypeParameters. · 7a379920
  Tim Wickberg authored Sep 29, 2016
```
Never ever run NHC, even on an edge case that NHC_NO would still launch NHC
after.

Bug 3105.
```
  7a379920
- Prevent update to last_part_update by partition uid access updates when no change made. · b9af7b9c
  Tim Wickberg authored Sep 29, 2016
```
Switch to list_for_each, and check if access list actually changed
after each update before updating last_prat_update. This prevents
the backfill scheduler from resetting mid-cycle unnecessarily.

Bug 3123.
```
  b9af7b9c
- Prevent update to last_part_update by partition uid access updates when no change made. · 9680c9c1
  Tim Wickberg authored Sep 29, 2016
```
Switch to list_for_each, and check if access list actually changed
after each update before updating last_prat_update. This prevents
the backfill scheduler from resetting mid-cycle unnecessarily.

Bug 3123.
```
  9680c9c1
- Add protocol version to slurmstepd startup message · d7c30f44
  Morris Jette authored Sep 28, 2016
```
Add protocol version to slurmd startup communications for slurmstepd to
    permit changes in the protocol.
```
  d7c30f44
28 Sep, 2016 3 commits

Add task launch flag field · 3251406c

Morris Jette authored Sep 28, 2016

Add "flag" field to launch_tasks_request_msg. Remove the following fields
    (moved into flags): multi_prog, task_flags, user_managed_io, pty,
    buffered_stdio, and labelio. More flags to be added later.

3251406c

Remove BlueGene/L and BlueGene/P support. · b818dd9d
Tim Wickberg authored Sep 28, 2016
```
Remove from build system, and delete L/P specific files.
Run autogen.sh as well.
```
b818dd9d

Add configuration parameter to control default wait-all-nodes option · 68299d7d

Morris Jette authored Sep 27, 2016

Add "sbatch_wait_nodes" to SchedulerParameters to control default sbatch
    behaviour with respect to waiting for all allocated nodes to be ready for
    use. Job can override the configuration option using the --wait-all-nodes=#
    option.
bug 3120

68299d7d

27 Sep, 2016 2 commits

Validate salloc/sbatch --wait-all-nodes argument · afffca2a

Morris Jette authored Sep 27, 2016

Prior logic would treat execute line like this:
$ sbatch --wait-all-nodes -N3 tmp
with "-N3" as being the argument to the "--wait-all-nodes" option.
See bug 3120

afffca2a

Add job option --use-min-nodes · a5686458

Morris Jette authored Sep 27, 2016

Add salloc/sbatch/srun option --use-min-nodes to prefer smaller node counts
    when a range of node counts is specified (e.g. "-N 2-4").
bug 2996

a5686458

26 Sep, 2016 1 commit

Add salloc/sbatch/srun --priority=top option · 62b9884f

Morris Jette authored Sep 26, 2016

Add salloc/sbatch/srun --priority option of "TOP" to set job priority to
    the highest possible value. This option is only available to Slurm operators
    and administrators.
bug 3115

62b9884f

24 Sep, 2016 2 commits
- Correct KNL task affinity logic · 40d8672b
  Morris Jette authored Sep 23, 2016
```
bug 3090
```
  40d8672b
- Fix cray requeue bug while step cleaning · 1a3cc77a
  Morris Jette authored Sep 22, 2016
```
Make sure no attempt is made to schedule a requeued job until all steps are
    cleaned (Node Health Check completes for all steps on a Cray).
bug 3082
```
  1a3cc77a