Commits · 1e78c111d6dc7a631a776b339d3797b6c1b4d988 · Manuel G. Marciani / ces_slurm_simulator

12 Sep, 2017 3 commits
- Fix default location for cgroup_allowed_devices_file.conf to use correct · 1e78c111
  Danny Auble authored Sep 12, 2017
```
default path.

This makes it so you don't always have to put AllowedDevicesFile in your
cgroup.conf file if your etc dir is anything other than /etc/slurm.
```
  1e78c111
- Fix autoconf test for libcurl when clang is the compiler. · d670de2d
  Tim Wickberg authored Sep 12, 2017
```
Adding a newline prevents this error:
conftest.c:154:8: error: if statement has empty body [-Werror,-Wempty-body]
```
  d670de2d
- If creating/altering a core based reservation with scontrol/sview on a · 3b3e67e1
  Alejandro Sanchez authored Sep 12, 2017
```
remote cluster correctly determine the select type.

Bug 2329
```
  3b3e67e1
08 Sep, 2017 2 commits

Dominik Bartkiewicz authored Sep 08, 2017

If /proc was inaccessible proc_name would leak.

Put an explicit length cap in sprintf to avoid warning. The
size is checked immediate before here so this is just making
the 10-char limit explicit.

Bug 4062.

901c3aec

Address some build warnings from GCC 7.1. · 919138f4
Dominik Bartkiewicz authored Sep 07, 2017
```
Bug 4062.
```
919138f4

07 Sep, 2017 2 commits
- Optimization enhancements for partition based job preemption · 0f501359
  Dominik Bartkiewicz authored Sep 07, 2017
```
bug 3824
```
  0f501359
- Cray: Don't run step NHC on external step · a6407a68
  Morris Jette authored Sep 07, 2017
```
Do not run the Node Health Check on termination of the external
  step as this happens when the job allocation ends and the job
  NHC will be executed anyway.
Bug 4074
```
  a6407a68
01 Sep, 2017 2 commits
- Check multiple partition limits when scheduling a job that were previously only · e566cf39
  Danny Auble authored Sep 01, 2017
```
checked on submit.

This only mattered when submitting a job to multiple partitions.

Bug 4066
```
  e566cf39
- Fix sbatch --signal to signal all MPI ranks in a step instead of just those · d8485b0d
  Danny Auble authored Aug 31, 2017
```
on node 0.

Bug 4035
```
  d8485b0d
24 Aug, 2017 1 commit

Prevent slurmstepd ABRT when parsing gres.conf CPUs. · 3e1fffb6

Alejandro Sanchez authored Aug 24, 2017

Calling bit_unfmt() with a zero bit_size() bitmap leads to a later
call to bit_nclear() with start=0 and stop=-1, leading to the ABRT.

This scenario happened when cgroup.conf has ConstrainDevices=yes and
task_cgroup_devices_create() tries to collect the GRES devices
but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt);
of zero size which is passed by argument to bit_unfmt().

gres_cpu_cnt is 0 because we have defined a gres.conf like this:

Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1
Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1
Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3
Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3

but have no GresTypes nor GRES option in the slurm.conf / node config def.

Bug 3974

3e1fffb6

23 Aug, 2017 1 commit

jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df

Alejandro Sanchez authored Aug 23, 2017

Running slurmctld under valgrind while operating with jobcomp/elasticsearch
reported the following bytes definitely lost:

==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
==27403==    by 0x22856A: makespace (xstring.c:114)
==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
==27403==    by 0x83C5BCD: ???
==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)

It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
the corresponding job_node->serialized_job, but the originally generated buffer
wasn't freed afterwards. The fix consists in change the transfer so that instead
of xstrdup'ing the char * we just assign the pointer and NULL the buffer.

The job_node->serialized_job was already xfree'd properly later when the job
was indexed.

Discovered while working on Bug 4065.

8172b7df

22 Aug, 2017 2 commits
- Strip trailing slashes from the JobCompLoc for jobcomp/elasticsearch. · 60eed77f
  Alejandro Sanchez authored Aug 22, 2017
```
Otherwise the resulting URL may be invalid. Update documentation
while here as well.

Bug 4065.
```
  60eed77f
- In salloc with --uid option, drop supplementary groups before changing UID · 1efbd459
  Philip Kovacs authored Aug 22, 2017
```
bug 4095
```
  1efbd459
21 Aug, 2017 1 commit

select/cons_res - fix bug with Dragonfly and --switches count timeout · 46c0919d

Alejandro Sanchez authored Aug 21, 2017

Given a configuration with TopologyParam including Dragonfly option, if a
job requested --switches count, the count timeout specified by either
the job request or max_switch_wait SchedulerParameters was not respected.
This was due to leaf_switch_count variable not being incremented in
_eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(),
the later being a execution path which already succeed to wait for the
switch count timeout.

Bug 4056

46c0919d

17 Aug, 2017 1 commit
- mpi/mvapich - Buffer being only partially cleared. No failures observed. · e7831316
  Morris Jette authored Aug 16, 2017
```
Coverity CID 44649

Bug 4085
```
  e7831316
16 Aug, 2017 1 commit
- Add 'slurmdbd:' to the accounting plugin to notify message is from dbd · 8014b5a4
  Danny Auble authored Aug 15, 2017
```
instead of local.

Bug 3546
```
  8014b5a4
15 Aug, 2017 1 commit
- Start NEWS for v 17.02.8 · 0de4a43b
  Morris Jette authored Aug 15, 2017
  
  0de4a43b
14 Aug, 2017 3 commits
- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. · f151c6c0
  Morris Jette authored Aug 14, 2017
  
  f151c6c0
- Revert "CRAY - Fix BB to handle type= correctly, regression in 17.02.6." · 80a6fa49
  Danny Auble authored Aug 14, 2017
```
This reverts commit 00a691b9.
```
  80a6fa49
- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. · 00a691b9
  Morris Jette authored Aug 08, 2017
  
  00a691b9
11 Aug, 2017 3 commits
- Add Dell option to the node_features/knl_generic plugin. · ab5c0900
  Danny Auble authored Aug 08, 2017
```
This will allow dell's custom syscfg to work correctly.

NOTE: Dell calls flat memory just memory.

Bug 4034
```
  ab5c0900
- Fix overlapping reservation resize · 9af4a934
  Danny Auble authored Aug 11, 2017
```
Bug 4059
```
  9af4a934
- Fix incorrect lock levels when creating or updating a reservation · 605d7e1f
  Dominik Bartkiewicz authored Aug 11, 2017
  
  605d7e1f
07 Aug, 2017 2 commits
- Make it so the cray/switch plugin grabs new DebugFlags on a reconfigure. · d30f79d1
  Danny Auble authored Aug 07, 2017
  
  d30f79d1
- Close race condition on Slurm structures when setting DebugFlags. · 13b78dd2
  Dominik Bartkiewicz authored Aug 07, 2017
```
Bug 4019
```
  13b78dd2
04 Aug, 2017 4 commits
- Correct buffer size used in determining specialized cores to avoid possible · 75bb7c40
  Morris Jette authored Aug 04, 2017
```
truncation of core specification and not reserving the specified cores.

Fixes Coverity CID 45174 and 45175

Bug 4053
```
  75bb7c40
- NEWS comment for last commit. · bf4ac7ee
  Danny Auble authored Aug 04, 2017
  
  bf4ac7ee
- Sort TRES id's on limits when getting them from the database. · 7e55acf7
  Danny Auble authored Aug 04, 2017
  
  7e55acf7
- Fix inherited association 'max' TRES limits combining multiple limits in · ab24f8b4
  Danny Auble authored Aug 04, 2017
```
the tree.

Bug 4050
```
  ab24f8b4
02 Aug, 2017 2 commits

Fix starting ctld w/out existing StateSaveLocation · ec78d45a

Marshall Garey authored Aug 02, 2017

Would fail when trying to create the clustername file because the
StateSaveLocation path didn't exist yet.

Bug 3988

ec78d45a

Fix srun jobs to run in high prio partition · 948de46b

Marshall Garey authored Aug 02, 2017

srun jobs that could start immediately and requested multiple partitions
didn't run in the highest priority partition if the highest priority
partition wasn't listed first.

It's possible that the scontrol show jobs will show the partition list
in priority order now that the job's partition list gets sorted by
priority.

Bug 4015

948de46b

01 Aug, 2017 2 commits
- Increase buffer to handle long /proc//stat output · 9f3b04c0
  Tim Shaw authored Aug 01, 2017
```
Bug 3999
```
  9f3b04c0
- Fix GRES selection with CPU binding · e94fdf2e
  Dominik Bartkiewicz authored Aug 01, 2017
```
Fix bug in selection of GRES bound to specific CPUs where the GRES count
    is 2 or more. Previous logic could allocate CPUs not available to the job.

bug 4029
```
  e94fdf2e
31 Jul, 2017 1 commit
- Docment inconsistent behavior of GroupUpdateForce option. · 333932bc
  Tim Shaw authored Jul 31, 2017
```
This will be fixed before 17.11, but is being left as-is on 17.02.

Bug 3956.
```
  333932bc
28 Jul, 2017 2 commits

Fix issue when an alternate munge key when communicating on a persistent · 591dc036
Danny Auble authored Jul 28, 2017
```
connection.

Bug 4009
```
591dc036

jobcomp/elasticsearch - save state on REQUEST_CONTROL. · 8944b77a

Alejandro Sanchez authored Jul 28, 2017

jobcomp/elasticsearch saves/load the state to/from elasticsearch_state.  Since
the jobcomp API isn't designed with save/load state operations, the plugin
_save_state() isn't extern and not available from outside the plugin itself,
thus it is highly coupled to fini() function. This state doesn't follow the
same execution path as the rest of Slurm states, where in save_all_sate()
they are all independently scheduled. So we save it manually here on a RPC
of type REQUEST_CONTROL.

This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup
which is currently in controller mode, the Backup will save the state and when
the Primary assumes control again it can process the saved pending jobs.  The
other way around was already controlled, because when the Primary is running
in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is
shutdown and when breaking the ctld main() function while(1) loop, there was
already a g_slurm_jobcomp_fini() call in place.

Bug 3908

8944b77a

27 Jul, 2017 1 commit

Fix bug when tracking multiple simultaneous spawned ping cycles · f7463ef5

Alejandro Sanchez authored Jul 27, 2017

When more than 1 ping cycle is spawned simultaneously (for instance
REQUEST_PING + REQUEST_NODE_REGISTRATION_STATUS for the selected nodes),
we do not track a separate ping_start time for each cycle. When ping_begin()
is called, the information about the previous ping cycle is lost. Then when
ping_end() is called for the first of the two cycles, we set ping_start=0,
which is incorrectly used to see if the last cycle ran for more than
PING_TIMEOUT seconds (100s), thus incorrectly triggering the:

error("Node ping apparently hung, many nodes may be DOWN or configured "
"SlurmdTimeout should be increased");

Bug 3914

f7463ef5

26 Jul, 2017 3 commits
- Fix issue where UnkillableStepProgram if step was in an ending state. · 9f48e07c
  Danny Auble authored Jul 26, 2017
  
  9f48e07c
- Fix minor memory leak if launch fails in the slurmstepd. · 558d7c1a
  Danny Auble authored Jul 24, 2017
  
  558d7c1a
- If failing after switch_g_job_init happened make sure switch_g_job_fini is called. · 488c7c36
  Danny Auble authored Jul 24, 2017
```
Bug 3865
```
  488c7c36