Commits · 5457580423de93a27134be72391ff2b5ad1b5b9e · Manuel G. Marciani / ces_slurm_simulator

13 Oct, 2017 5 commits

CRAY - Add rpath logic to work for the alpscomm libs · 54575804
Morris Jette authored Oct 13, 2017

54575804

Add empty hashes in perl api for hidden nodes · 0898ff91

Brian Christiansen authored Oct 13, 2017

The contoller will return node records with a NULL name for nodes that
are hidden. This is so that you can map a partition_info's nodes --
using it's node_inx[] -- to a node record in the returned node_array
from slurm_load_node().

Previously the perl api would leave an undefined object in the the
node_array if the hidden nodes were found before a real node in the
node_array and any hidden nodes at the end of the array from the
controller wouldn't be counted for in the perl node_array. This patch
adds empty hashes for hidden nodes and preserves the record_count and
node_array from the slurmctld.

Bug 4250

0898ff91

Skip over undefined nodes in pbsnodes · 0fac18c8

Brian Christiansen authored Oct 13, 2017

The perl api leaves undefined objects in the node_array returned by
load_nodes() for any node that is hidden.

But 4250

0fac18c8

Avoid error on Cray duplicate setup of core spec · 525cde12
Morris Jette authored Oct 13, 2017
```
Bug 4003
```
525cde12
Gracefull handle race condition when reading /proc · ba32aa01
Morris Jette authored Oct 13, 2017
```
as process exits

Bug 4003
```
ba32aa01

10 Oct, 2017 4 commits
- Fix sorting of case insensitive strings · 1f3d0eb1
  Brian Christiansen authored Oct 10, 2017
```
when using xstrcasecmp. Matching up with other xstrcmp() functions.
```
  1f3d0eb1
- Add aliased name for xstrncmp · 9aaf6331
  Brian Christiansen authored Oct 10, 2017
```
was missing
```
  9aaf6331
- Set TRES limits with case insensitive TRES names · 26307387
  Isaac Hartung authored Oct 10, 2017
```
Bug 4226
```
  26307387
- bit_fmt is a function, which is why the compiler didn't complain · f293dea5
  Tim Wickberg authored Oct 10, 2017
```
that there was no bit_fmt was out of scope on the xfree. Passing
a function address to xfree() predictably does not work very well.

Change the variable name to avoid confusion.

Bug 4241
```
  f293dea5
05 Oct, 2017 1 commit

Show correct MaxTRESPerNode limit assoc reasons · 6e806f2d

Brian Christiansen authored Oct 05, 2017

Before:
$ sbatch --wrap="sleep 300"
Submitted batch job 228
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
               228     debug     wrap    brian PD       0:00      1 (AssocMaxUnknownPerNode)

Fixed:
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
               229     debug     wrap    brian PD       0:00      1 (AssocMaxCpuPerNode)

$ sacctmgr mod account stuff set maxtrespernode=cpu=-1,mem=1
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
               229     debug     wrap    brian PD       0:00      1 (AssocMaxMemPerNode)

$ sbatch --wrap="sleep 300" --gres=blah:2 -pgpu
Submitted batch job 235
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
               235       gpu     wrap    brian PD       0:00      1 (AssocMaxGRESPerNode)

6e806f2d

04 Oct, 2017 1 commit

burst_buffer/cray plugin updated for Cray UP06 sofware · 859f6c82

Morris Jette authored Oct 04, 2017

burst_buffer/cray plugin modified to work with changes in Cray UP06
   software release.
Specific changes: Cray software now returns an error if a state_in
   or stage_out script is processed that doesn't actually request a
   stage in or out (previously silently ignored).
Also the warning message about tearing down a buffer that is already
   gone changed.

859f6c82

02 Oct, 2017 2 commits
- Prevent sacctmgr segfault if no associations specified to update. · 85d3bfc2
  Dominik Bartkiewicz authored Oct 01, 2017
```
Move the check up a bit more where it'll do some good.

Bug 4184.
```
  85d3bfc2
- Change warnings from info to debug2 in _part_access_check. · 34486791
  Dominik Bartkiewicz authored Oct 01, 2017
```
Bug 4146.
```
  34486791
29 Sep, 2017 2 commits
- Make scontrol work correctly with job update timelimit [+|-]=. · b37e58bf
  Danny Auble authored Sep 29, 2017
```
Bug 3467
```
  b37e58bf
- Fix validating time spec to correctly validate various time formats. · 138fe83f
  Danny Auble authored Sep 29, 2017
```
Bug 3567
```
  138fe83f
27 Sep, 2017 2 commits
- Fix issue where the slurmstepd would fatal on job launch if you have no · edb0ea61
  Danny Auble authored Sep 27, 2017
```
gres listed in your slurm.conf but some in gres.conf.

Bug 3974
```
  edb0ea61
- Fix issue that would deny the stepd access to /dev/null where GRES has a · 2e0cee37
  Danny Auble authored Sep 27, 2017
```
'type' but no file defined.
```
  2e0cee37
19 Sep, 2017 3 commits
- Make extremely verbose info messages debug2 messages in the task/cgroup · 86ababbb
  Danny Auble authored Sep 18, 2017
```
plugin when constraining devices.
```
  86ababbb
- Fix memory leaks in the task/cgroup plugin when constraining devices. · e8d8dc31
  Danny Auble authored Sep 18, 2017
  
  e8d8dc31
- Handle old 32bit values stored in the database for requested memory · 7bf6ade8
  Danny Auble authored Sep 13, 2017
```
correctly in sacct.
```
  7bf6ade8
14 Sep, 2017 1 commit

Prevent a second PMI2_Init call from leaving a hung slurmstepd process. · b2aa25d5

Tim Wickberg authored Sep 14, 2017

A second PMI2_Init() within the same step is invalid, and cannot succeed.

Return an error code back to the client end, and close the fd to force the
step to terminate immediately.

Due to a bug in our libpmi code, just returning a cmd=response_to_init with
an appropriate rc number will not tear down the connection properly, so
send back something else that will trigger the error path.

Bug 3520.

b2aa25d5

13 Sep, 2017 1 commit
- Document NewName option to sacctmgr. · d08f34f2
  Josh Samuelson authored Sep 12, 2017
```
Bug 4154.
```
  d08f34f2
12 Sep, 2017 3 commits
- Fix default location for cgroup_allowed_devices_file.conf to use correct · 1e78c111
  Danny Auble authored Sep 12, 2017
```
default path.

This makes it so you don't always have to put AllowedDevicesFile in your
cgroup.conf file if your etc dir is anything other than /etc/slurm.
```
  1e78c111
- Fix autoconf test for libcurl when clang is the compiler. · d670de2d
  Tim Wickberg authored Sep 12, 2017
```
Adding a newline prevents this error:
conftest.c:154:8: error: if statement has empty body [-Werror,-Wempty-body]
```
  d670de2d
- If creating/altering a core based reservation with scontrol/sview on a · 3b3e67e1
  Alejandro Sanchez authored Sep 12, 2017
```
remote cluster correctly determine the select type.

Bug 2329
```
  3b3e67e1
08 Sep, 2017 2 commits

Fix two GCC 7.1 warnings. · 901c3aec

Dominik Bartkiewicz authored Sep 08, 2017

If /proc was inaccessible proc_name would leak.

Put an explicit length cap in sprintf to avoid warning. The
size is checked immediate before here so this is just making
the 10-char limit explicit.

Bug 4062.

901c3aec

Address some build warnings from GCC 7.1. · 919138f4
Dominik Bartkiewicz authored Sep 07, 2017
```
Bug 4062.
```
919138f4

07 Sep, 2017 2 commits
- Optimization enhancements for partition based job preemption · 0f501359
  Dominik Bartkiewicz authored Sep 07, 2017
```
bug 3824
```
  0f501359
- Cray: Don't run step NHC on external step · a6407a68
  Morris Jette authored Sep 07, 2017
```
Do not run the Node Health Check on termination of the external
  step as this happens when the job allocation ends and the job
  NHC will be executed anyway.
Bug 4074
```
  a6407a68
01 Sep, 2017 2 commits
- Check multiple partition limits when scheduling a job that were previously only · e566cf39
  Danny Auble authored Sep 01, 2017
```
checked on submit.

This only mattered when submitting a job to multiple partitions.

Bug 4066
```
  e566cf39
- Fix sbatch --signal to signal all MPI ranks in a step instead of just those · d8485b0d
  Danny Auble authored Aug 31, 2017
```
on node 0.

Bug 4035
```
  d8485b0d
24 Aug, 2017 1 commit

Prevent slurmstepd ABRT when parsing gres.conf CPUs. · 3e1fffb6

Alejandro Sanchez authored Aug 24, 2017

Calling bit_unfmt() with a zero bit_size() bitmap leads to a later
call to bit_nclear() with start=0 and stop=-1, leading to the ABRT.

This scenario happened when cgroup.conf has ConstrainDevices=yes and
task_cgroup_devices_create() tries to collect the GRES devices
but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt);
of zero size which is passed by argument to bit_unfmt().

gres_cpu_cnt is 0 because we have defined a gres.conf like this:

Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1
Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1
Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3
Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3

but have no GresTypes nor GRES option in the slurm.conf / node config def.

Bug 3974

3e1fffb6

23 Aug, 2017 1 commit

jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df

Alejandro Sanchez authored Aug 23, 2017

Running slurmctld under valgrind while operating with jobcomp/elasticsearch
reported the following bytes definitely lost:

==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
==27403==    by 0x22856A: makespace (xstring.c:114)
==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
==27403==    by 0x83C5BCD: ???
==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)

It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
the corresponding job_node->serialized_job, but the originally generated buffer
wasn't freed afterwards. The fix consists in change the transfer so that instead
of xstrdup'ing the char * we just assign the pointer and NULL the buffer.

The job_node->serialized_job was already xfree'd properly later when the job
was indexed.

Discovered while working on Bug 4065.

8172b7df

22 Aug, 2017 2 commits
- Strip trailing slashes from the JobCompLoc for jobcomp/elasticsearch. · 60eed77f
  Alejandro Sanchez authored Aug 22, 2017
```
Otherwise the resulting URL may be invalid. Update documentation
while here as well.

Bug 4065.
```
  60eed77f
- In salloc with --uid option, drop supplementary groups before changing UID · 1efbd459
  Philip Kovacs authored Aug 22, 2017
```
bug 4095
```
  1efbd459
21 Aug, 2017 1 commit

select/cons_res - fix bug with Dragonfly and --switches count timeout · 46c0919d

Alejandro Sanchez authored Aug 21, 2017

Given a configuration with TopologyParam including Dragonfly option, if a
job requested --switches count, the count timeout specified by either
the job request or max_switch_wait SchedulerParameters was not respected.
This was due to leaf_switch_count variable not being incremented in
_eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(),
the later being a execution path which already succeed to wait for the
switch count timeout.

Bug 4056

46c0919d

17 Aug, 2017 1 commit
- mpi/mvapich - Buffer being only partially cleared. No failures observed. · e7831316
  Morris Jette authored Aug 16, 2017
```
Coverity CID 44649

Bug 4085
```
  e7831316
16 Aug, 2017 1 commit
- Add 'slurmdbd:' to the accounting plugin to notify message is from dbd · 8014b5a4
  Danny Auble authored Aug 15, 2017
```
instead of local.

Bug 3546
```
  8014b5a4
15 Aug, 2017 1 commit
- Start NEWS for v 17.02.8 · 0de4a43b
  Morris Jette authored Aug 15, 2017
  
  0de4a43b
14 Aug, 2017 1 commit
- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. · f151c6c0
  Morris Jette authored Aug 14, 2017
  
  f151c6c0