Commits · d1335bf84f464c25cbcdceaac201c8a8a55cf428 · Manuel G. Marciani / ces_slurm_simulator

18 Aug, 2017 1 commit
- Enable SPANK plugin support on a per-heterogeneous job component basis · d1335bf8
  Morris Jette authored Aug 18, 2017
  
  d1335bf8
17 Aug, 2017 1 commit

Clean up partial step creation · 08bca019

Morris Jette authored Aug 17, 2017

In srun, if only some steps are allocated and one step allocation fails,
 then delete all allocated steps.

08bca019

16 Aug, 2017 2 commits

Set SLURM_NTASKS for pack job · 540804f5

Morris Jette authored Aug 16, 2017

Set SLURM_NTASKS environment variable to reflect global task count
(needed by MPI).

540804f5

Set SLURM_PROCID for pack job · eb319e30

Morris Jette authored Aug 16, 2017

Set SLURM_PROCID environment variable to reflect global task rank
(needed by MPI).

eb319e30

15 Aug, 2017 4 commits
- Start v17.11.0-pre3 NEWS · 395b6eec
  Morris Jette authored Aug 15, 2017
  
  395b6eec
- Revert commit 6dc64628 · ef4d83c2
  Morris Jette authored Aug 15, 2017
```
bug 3217
```
  ef4d83c2
- Start NEWS for v 17.02.8 · 0de4a43b
  Morris Jette authored Aug 15, 2017
  
  0de4a43b
- propagate pack job application info · a51f4799
  Morris Jette authored Aug 15, 2017
```
If srun lacks application specification for some component, the next one
      specified will be used for earlier components.
```
  a51f4799
14 Aug, 2017 3 commits
- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. · f151c6c0
  Morris Jette authored Aug 14, 2017
  
  f151c6c0
- Revert "CRAY - Fix BB to handle type= correctly, regression in 17.02.6." · 80a6fa49
  Danny Auble authored Aug 14, 2017
```
This reverts commit 00a691b9.
```
  80a6fa49
- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. · 00a691b9
  Morris Jette authored Aug 08, 2017
  
  00a691b9
12 Aug, 2017 1 commit

scontrol job update for specific pack job components · bb8b4214

Morris Jette authored Aug 11, 2017

Modify scontrol job hold/release and update to operate with heterogeneous
      job id specification (e.g. "scontrol hold 123+4").

bb8b4214

11 Aug, 2017 5 commits
- Sview stderr fix · 6dc64628
  Alejandro Sanchez authored Aug 11, 2017
```
Fix sview to avoid messages to stderr when modifying a block, partition,
    or reservation.
bug 3217
```
  6dc64628
- Add Dell option to the node_features/knl_generic plugin. · ab5c0900
  Danny Auble authored Aug 08, 2017
```
This will allow dell's custom syscfg to work correctly.

NOTE: Dell calls flat memory just memory.

Bug 4034
```
  ab5c0900
- Disable cancellation of individual component while the job is pending · 5335dd9a
  Morris Jette authored Aug 11, 2017
```
Doing so would break the current scheduling logic.
```
  5335dd9a
- Fix overlapping reservation resize · 9af4a934
  Danny Auble authored Aug 11, 2017
```
Bug 4059
```
  9af4a934
- Fix incorrect lock levels when creating or updating a reservation · 605d7e1f
  Dominik Bartkiewicz authored Aug 11, 2017
  
  605d7e1f
10 Aug, 2017 2 commits
- Set up pack-job debugger data structures · ad2fe42d
  Morris Jette authored Aug 10, 2017
  
  ad2fe42d
- Add logic to support srun --mpi-combine option · 5a3cd962
  Morris Jette authored Aug 10, 2017
  
  5a3cd962
07 Aug, 2017 2 commits
- Make it so the cray/switch plugin grabs new DebugFlags on a reconfigure. · d30f79d1
  Danny Auble authored Aug 07, 2017
  
  d30f79d1
- Close race condition on Slurm structures when setting DebugFlags. · 13b78dd2
  Dominik Bartkiewicz authored Aug 07, 2017
```
Bug 4019
```
  13b78dd2
04 Aug, 2017 6 commits
- Correct buffer size used in determining specialized cores to avoid possible · 75bb7c40
  Morris Jette authored Aug 04, 2017
```
truncation of core specification and not reserving the specified cores.

Fixes Coverity CID 45174 and 45175

Bug 4053
```
  75bb7c40
- NEWS comment for last commit. · bf4ac7ee
  Danny Auble authored Aug 04, 2017
  
  bf4ac7ee
- Sort TRES id's on limits when getting them from the database. · 7e55acf7
  Danny Auble authored Aug 04, 2017
  
  7e55acf7
- Fix inherited association 'max' TRES limits combining multiple limits in · ab24f8b4
  Danny Auble authored Aug 04, 2017
```
the tree.

Bug 4050
```
  ab24f8b4
- Signal full pack job · fa3daf39
  Morris Jette authored Aug 03, 2017
```
Modify launch/slurm plugin to signal all components of a pack job rather
   than just the one (modify to use a list of step context records).
```
  fa3daf39
- Add step signal retry logic · 3301a6b1
  Morris Jette authored Aug 03, 2017
```
If prolog is running when attempting to signal a step, then return EAGAIN
   and retry rather than simply returning SLURM_ERROR and aborting.
```
  3301a6b1
03 Aug, 2017 1 commit

pack job step I/O race condition fix · 71a34f56

Morris Jette authored Aug 03, 2017

Fix I/O race condition on step termination for srun launching multiple
pack job groups. Without this change application output might be
lost and/or the srun command might hang after some tasks exit.

71a34f56

02 Aug, 2017 4 commits

Fix starting ctld w/out existing StateSaveLocation · ec78d45a

Marshall Garey authored Aug 02, 2017

Would fail when trying to create the clustername file because the
StateSaveLocation path didn't exist yet.

Bug 3988

ec78d45a

Fix srun jobs to run in high prio partition · 948de46b

Marshall Garey authored Aug 02, 2017

srun jobs that could start immediately and requested multiple partitions
didn't run in the highest priority partition if the highest priority
partition wasn't listed first.

It's possible that the scontrol show jobs will show the partition list
in priority order now that the job's partition list gets sorted by
priority.

Bug 4015

948de46b

Update some documentation around GroupUpdateForce/GroupUpdateTime changes. · 3203961f
Tim Wickberg authored Aug 02, 2017
```
Bug 3956.
```
3203961f

pack job accounting work · 2b4495ec

Morris Jette authored Aug 02, 2017

Add pack_job_id and pack_job_offset to accounting database.
Modified sacct to accept pack job ID specification using "#+#" notation.
Modified sstat to accept pack job ID specification using "#+#" notation.

2b4495ec

01 Aug, 2017 3 commits

Increase buffer to handle long /proc//stat output · 9f3b04c0
Tim Shaw authored Aug 01, 2017
```
Bug 3999
```
9f3b04c0

Handle GroupUpdateForce option correctly. · 6813e486

Tim Shaw authored Aug 01, 2017

Default to 1, unless set to 0. Allow to be set to 0 even if
GroupUpdateTime was not set before.

Move down to alphabetical position in read_config.c as well.

Bug 3956.

6813e486

Fix GRES selection with CPU binding · e94fdf2e

Dominik Bartkiewicz authored Aug 01, 2017

Fix bug in selection of GRES bound to specific CPUs where the GRES count
    is 2 or more. Previous logic could allocate CPUs not available to the job.

bug 4029

e94fdf2e

31 Jul, 2017 1 commit
- Docment inconsistent behavior of GroupUpdateForce option. · 333932bc
  Tim Shaw authored Jul 31, 2017
```
This will be fixed before 17.11, but is being left as-is on 17.02.

Bug 3956.
```
  333932bc
28 Jul, 2017 3 commits

Fix issue when an alternate munge key when communicating on a persistent · 591dc036
Danny Auble authored Jul 28, 2017
```
connection.

Bug 4009
```
591dc036

jobcomp/elasticsearch - save state on REQUEST_CONTROL. · 8944b77a

Alejandro Sanchez authored Jul 28, 2017

jobcomp/elasticsearch saves/load the state to/from elasticsearch_state.  Since
the jobcomp API isn't designed with save/load state operations, the plugin
_save_state() isn't extern and not available from outside the plugin itself,
thus it is highly coupled to fini() function. This state doesn't follow the
same execution path as the rest of Slurm states, where in save_all_sate()
they are all independently scheduled. So we save it manually here on a RPC
of type REQUEST_CONTROL.

This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup
which is currently in controller mode, the Backup will save the state and when
the Primary assumes control again it can process the saved pending jobs.  The
other way around was already controlled, because when the Primary is running
in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is
shutdown and when breaking the ctld main() function while(1) loop, there was
already a g_slurm_jobcomp_fini() call in place.

Bug 3908

8944b77a

Perform pack job limits check at submit time · 058b99b6

Morris Jette authored Jul 28, 2017

Perform limit check on heterogeneous job as a whole at submit time to
   reject jobs that will never be able to run. Accepting pack jobs
   that can never start will have a significant effect on scheduling
   in general (blocking the queue).

058b99b6

27 Jul, 2017 1 commit

Fix bug when tracking multiple simultaneous spawned ping cycles · f7463ef5

Alejandro Sanchez authored Jul 27, 2017

When more than 1 ping cycle is spawned simultaneously (for instance
REQUEST_PING + REQUEST_NODE_REGISTRATION_STATUS for the selected nodes),
we do not track a separate ping_start time for each cycle. When ping_begin()
is called, the information about the previous ping cycle is lost. Then when
ping_end() is called for the first of the two cycles, we set ping_start=0,
which is incorrectly used to see if the last cycle ran for more than
PING_TIMEOUT seconds (100s), thus incorrectly triggering the:

error("Node ping apparently hung, many nodes may be DOWN or configured "
"SlurmdTimeout should be increased");

Bug 3914

f7463ef5