Commits · 748ebdde0989577ed8b1b4ddff0fb04d6eb2d5c5 · Manuel G. Marciani / ces_slurm_simulator

10 Aug, 2017 1 commit
- Add logic to support srun --mpi-combine option · 5a3cd962
  Morris Jette authored Aug 10, 2017
  
  5a3cd962
04 Aug, 2017 2 commits

Morris Jette authored Aug 03, 2017

Modify launch/slurm plugin to signal all components of a pack job rather
   than just the one (modify to use a list of step context records).

fa3daf39

Add step signal retry logic · 3301a6b1

Morris Jette authored Aug 03, 2017

If prolog is running when attempting to signal a step, then return EAGAIN
   and retry rather than simply returning SLURM_ERROR and aborting.

3301a6b1

03 Aug, 2017 1 commit

pack job step I/O race condition fix · 71a34f56

Morris Jette authored Aug 03, 2017

Fix I/O race condition on step termination for srun launching multiple
pack job groups. Without this change application output might be
lost and/or the srun command might hang after some tasks exit.

71a34f56

02 Aug, 2017 2 commits

Update some documentation around GroupUpdateForce/GroupUpdateTime changes. · 3203961f
Tim Wickberg authored Aug 02, 2017
```
Bug 3956.
```
3203961f

pack job accounting work · 2b4495ec

Morris Jette authored Aug 02, 2017

Add pack_job_id and pack_job_offset to accounting database.
Modified sacct to accept pack job ID specification using "#+#" notation.
Modified sstat to accept pack job ID specification using "#+#" notation.

2b4495ec

01 Aug, 2017 3 commits

Increase buffer to handle long /proc//stat output · 9f3b04c0
Tim Shaw authored Aug 01, 2017
```
Bug 3999
```
9f3b04c0

Handle GroupUpdateForce option correctly. · 6813e486

Tim Shaw authored Aug 01, 2017

Default to 1, unless set to 0. Allow to be set to 0 even if
GroupUpdateTime was not set before.

Move down to alphabetical position in read_config.c as well.

Bug 3956.

6813e486

Fix GRES selection with CPU binding · e94fdf2e

Dominik Bartkiewicz authored Aug 01, 2017

Fix bug in selection of GRES bound to specific CPUs where the GRES count
    is 2 or more. Previous logic could allocate CPUs not available to the job.

bug 4029

e94fdf2e

31 Jul, 2017 1 commit
- Docment inconsistent behavior of GroupUpdateForce option. · 333932bc
  Tim Shaw authored Jul 31, 2017
```
This will be fixed before 17.11, but is being left as-is on 17.02.

Bug 3956.
```
  333932bc
28 Jul, 2017 3 commits

Fix issue when an alternate munge key when communicating on a persistent · 591dc036
Danny Auble authored Jul 28, 2017
```
connection.

Bug 4009
```
591dc036

jobcomp/elasticsearch - save state on REQUEST_CONTROL. · 8944b77a

Alejandro Sanchez authored Jul 28, 2017

jobcomp/elasticsearch saves/load the state to/from elasticsearch_state.  Since
the jobcomp API isn't designed with save/load state operations, the plugin
_save_state() isn't extern and not available from outside the plugin itself,
thus it is highly coupled to fini() function. This state doesn't follow the
same execution path as the rest of Slurm states, where in save_all_sate()
they are all independently scheduled. So we save it manually here on a RPC
of type REQUEST_CONTROL.

This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup
which is currently in controller mode, the Backup will save the state and when
the Primary assumes control again it can process the saved pending jobs.  The
other way around was already controlled, because when the Primary is running
in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is
shutdown and when breaking the ctld main() function while(1) loop, there was
already a g_slurm_jobcomp_fini() call in place.

Bug 3908

8944b77a

Perform pack job limits check at submit time · 058b99b6

Morris Jette authored Jul 28, 2017

Perform limit check on heterogeneous job as a whole at submit time to
   reject jobs that will never be able to run. Accepting pack jobs
   that can never start will have a significant effect on scheduling
   in general (blocking the queue).

058b99b6

27 Jul, 2017 1 commit

Fix bug when tracking multiple simultaneous spawned ping cycles · f7463ef5

Alejandro Sanchez authored Jul 27, 2017

When more than 1 ping cycle is spawned simultaneously (for instance
REQUEST_PING + REQUEST_NODE_REGISTRATION_STATUS for the selected nodes),
we do not track a separate ping_start time for each cycle. When ping_begin()
is called, the information about the previous ping cycle is lost. Then when
ping_end() is called for the first of the two cycles, we set ping_start=0,
which is incorrectly used to see if the last cycle ran for more than
PING_TIMEOUT seconds (100s), thus incorrectly triggering the:

error("Node ping apparently hung, many nodes may be DOWN or configured "
"SlurmdTimeout should be increased");

Bug 3914

f7463ef5

26 Jul, 2017 4 commits
- Fix issue where UnkillableStepProgram if step was in an ending state. · 9f48e07c
  Danny Auble authored Jul 26, 2017
  
  9f48e07c
- Add configuration parameters for daemons to write to syslog · 05ee90f1
  Isaac Hartung authored Jul 26, 2017
```
 -- Add slurm.conf configuration parameters SlurmctldSyslogDebug and
    SlurmdSyslogDebug to control which messages from the slurmctld and
    slurmd daemons get written to syslog.
 -- Add slurmdbd.conf configuration parameter DebugLevelSyslog to
    control which messages from the slurmdbd daemon get written to syslog.
bug 3933
```
  05ee90f1
- Fix minor memory leak if launch fails in the slurmstepd. · 558d7c1a
  Danny Auble authored Jul 24, 2017
  
  558d7c1a
- If failing after switch_g_job_init happened make sure switch_g_job_fini is called. · 488c7c36
  Danny Auble authored Jul 24, 2017
```
Bug 3865
```
  488c7c36
25 Jul, 2017 2 commits

pack job backfill scheduling fixes · 98435044
Morris Jette authored Jul 25, 2017
```
Adds assocation and QOS limits for the pack job as a whole
```
98435044

Clear job's reason of BeginTime · e4cb80f5

Morris Jette authored Jul 25, 2017

Clear a job's "wait reason" value of BeginTime" after that time has
passed. Previously a readon of "BeginTime" could be reported long
after the job's requested begin time had passed (for so long as the
current reason is "None".

e4cb80f5

24 Jul, 2017 4 commits
- CRAY - Throttle step creation if trying to create too many steps at once. · f9f13a86
  Morris Jette authored Jul 24, 2017
  
  f9f13a86
- pack job scheduling work · 0b7a7b4c
  Morris Jette authored Jul 24, 2017
```
Add support to sched/backfill for concurrent allocation of all pack job
      components including support of --time-min option.
```
  0b7a7b4c
- Set Reason=dependency over Reason=JobArrayTaskLimit for pending jobs. · ad0b7c27
  Dominik Bartkiewicz authored Jul 05, 2017
```
Bug 3953
```
  ad0b7c27
- Fix memory leak in slurmctld when agent queue to the DBD has filled up. · 6c7b9ba1
  Danny Auble authored Jul 24, 2017
```
Pretty much fix the entire purpose of this max_agent_queue.
```
  6c7b9ba1
21 Jul, 2017 3 commits
- Serialize updates from from the dbd to the slurmctld. · 24375cb8
  Danny Auble authored Jul 21, 2017
```
Bug 3159
```
  24375cb8
- Fixed truncation on scontrol show config output. · 5b1983d5
  Tim Shaw authored Jul 21, 2017
```
Bug 3956
```
  5b1983d5
- Better debug when slurmdbd queue is filling up in the slurmctld. · 4a46c5a6
  Danny Auble authored Jul 21, 2017
```
Bug 3967
```
  4a46c5a6
19 Jul, 2017 3 commits
- Fix race condition when using jobacct_gather/cgroup where the memory of the · e5c05549
  Danny Auble authored Jul 19, 2017
```
step wasn't always gathered correctly.

Bug 3531
```
  e5c05549
- Prevent slurmctld abort with gres socket binding · c850ccf4
  Morris Jette authored Jul 19, 2017
```
Fix for possible slurmctld abort with use of salloc/sbatch/srun
    --gres-flags=enforce-binding option.
bug 4008
```
  c850ccf4
- Clarify item in NEWS · ac896302
  Morris Jette authored Jul 19, 2017
```
Update from commit b40bd8d3
```
  ac896302
18 Jul, 2017 3 commits

Fix issue with multiple jobs from an array to start. · b40bd8d3

Dominik Bartkiewicz authored Jul 18, 2017

By removing the real locks we can get into a race condition where the prolog
starts and finishes before we get here and then we end up waiting forever.

Making the mutex a static seemed to help in many cases, but didn't
completely close the window.  Changing slurm_cond_wait to
slurm_cond_timedwait fixed the scenario where we would hit the window, but
not degrade performance the original commit provides.

There were also spots where if the job or step didn't exist it wouldn't
signal the conditional also providing a spot this could get stuck not
starting the job.

Fix regression from commit 52ce3ff0

Bug 3977

b40bd8d3

Add srun env vars for pack job · b4992871
Morris Jette authored Jul 18, 2017

b4992871
Updated RELEASE_NOTES file for v17.11 · d5bd9ced
Morris Jette authored Jul 18, 2017

d5bd9ced

17 Jul, 2017 1 commit

refactor task output logic · 57447402

Morris Jette authored Jul 17, 2017

Avoid interleaving labels and output from various components of
   a pack job

57447402

14 Jul, 2017 4 commits

Add PrivateData=events · 880ed315
Tim Shaw authored Jul 14, 2017

880ed315

task state containers modified for pack jobs · dba6a2ce

Morris Jette authored Jul 14, 2017

Major re-write of task state container logic to support for list of
  containers rather than one container per srun command.

dba6a2ce

Daemons re-open log files on receipt of SIGUSR2 · dc6e5ec2

Isaac Hartung authored Jul 14, 2017

Modify all daemons to re-open log files on receipt of SIGUSR2 signal.  This
    is much than using SIGHUP to re-read the configuration file and rebuild
    various tables.
bug 3070

dc6e5ec2

Fix issue with whole gres not being printed out with Slurm tools. · 028bf3e1

Danny Auble authored Jul 13, 2017

This is a regression from commit fec995e0.

It turns out using tok here was erroneous for situations where the gres had
a type and name and potentially a count (i.e. network:gigabit:1)

_get_gres_req_cnt() would alter the incoming char *config which is what tok
was.  So when we print it back to the requested string it would only have
what was there to the first ':'.  As we didn't need to \0 out the first char
as we skip over it anyway I just kept track of what the replaced \0 was for
the number portion and put it back when we are done copying it.

Related to bug 3521

028bf3e1

13 Jul, 2017 2 commits
- Cosmetic changes to task launch · ce2c09b1
  Morris Jette authored Jul 13, 2017
```
No changes to logic
```
  ce2c09b1
- Update typo in NEWS · 311dfd67
  Morris Jette authored Jul 13, 2017
  
  311dfd67