Commits · 8c5abdd747f0139a55784441e8aba259d6a99d01 · Manuel G. Marciani / ces_slurm_simulator

08 Sep, 2017 1 commit
- Allow nodes to be rebooted while in maint resv · 8c5abdd7
  Marshall Garey authored Sep 07, 2017
```
Bug 4084
```
  8c5abdd7
07 Sep, 2017 2 commits
- Optimization enhancements for partition based job preemption · 0f501359
  Dominik Bartkiewicz authored Sep 07, 2017
```
bug 3824
```
  0f501359
- Cray: Don't run step NHC on external step · a6407a68
  Morris Jette authored Sep 07, 2017
```
Do not run the Node Health Check on termination of the external
  step as this happens when the job allocation ends and the job
  NHC will be executed anyway.
Bug 4074
```
  a6407a68
05 Sep, 2017 1 commit
- Update docs for sched_min_interval default change · 623bd0fd
  Brian Christiansen authored Sep 05, 2017
```
From ea621ec4
```
  623bd0fd
04 Sep, 2017 1 commit

Fix to test job mem against MaxMemPer[CPU|Node] limits at scheduling time. · 24365514

Alejandro Sanchez authored Sep 04, 2017

Initially job mem limits were tested at submission time through
_validate_min_mem_partition() -> _valid_pn_min_mem(), but not tested
again at scheduling time, thus leading to jobs incorrectly being scheduled
against partitions where the job exceeded their MaxMemPer* limit
(which can in turn be inherited from the system wide limit too).

NOTE: New WAIT_PN_MEM_LIMIT job_state_reason enum component added to support
this new waiting reason.

Bug 2291.

24365514

02 Sep, 2017 1 commit

Make all steps from an srun have same ID · e89155c5

Morris Jette authored Sep 01, 2017

Modify step create logic so that call components of a heterogeneous job
      launched by a single srun command have the same step ID value.

e89155c5

01 Sep, 2017 4 commits

Check multiple partition limits when scheduling a job that were previously only · e566cf39
Danny Auble authored Sep 01, 2017
```
checked on submit.

This only mattered when submitting a job to multiple partitions.

Bug 4066
```
e566cf39
Fix sbatch --signal to signal all MPI ranks in a step instead of just those · d8485b0d
Danny Auble authored Aug 31, 2017
```
on node 0.

Bug 4035
```
d8485b0d

pack job allocation work · 7d46d43e

Morris Jette authored Sep 01, 2017

Prevent a heterogeneous job allocation from including the same nodes in
      multiple components (required by MPI jobs spanning components).

7d46d43e

Do not print TmpDisk as part of 'slurmd -C' line. · 6aecc72d

Tim Wickberg authored Aug 31, 2017

To avoid a reliance on the slurm.conf file (as this command is
designed to help you construct said file) the path to check the
space was hard-coded as /tmp. But, if TmpFS is meant to be elsewhere
we'd emit the wrong value here.

Bug 3272.

6aecc72d

31 Aug, 2017 1 commit
- Fix - Raise an error back to the user when trying to update currently · 038f885d
  Alejandro Sanchez authored Aug 30, 2017
```
unsupported core-based reservations.
```
  038f885d
30 Aug, 2017 3 commits

Fix statically linked applications to CRAY's PMI. · 85d83258

David Gloe authored Aug 30, 2017

Statically linked Cray PMI applications still expect to use some file paths
containing the old SLURM_ID_HASH format. Some Cray customers have
certification requirements that make recompilation difficult.

The attached patch defines a macro to convert the new SLURM_ID_HASH
to the old format, and writes the files and symlinks necessary for statically
linked Cray PMI applications to work.

Bug 4114

85d83258

Add LaunchParameters=batch_step_set_cpu_freq to allow the setting of the cpu · 8f1d06c3
Danny Auble authored Aug 30, 2017
```
frequency on the batch step.

Bug 4073
Also see Bug 3510
```
8f1d06c3
Add missing NEWS item for MaxQueryTimeLimit option. · 25bc337c
Tim Wickberg authored Aug 29, 2017

25bc337c

29 Aug, 2017 5 commits
- Make the UsageFactor of a QOS work when a qos has the nodecay flag. · f51a77fa
  Brian Christiansen authored Aug 29, 2017
```
Bug 4090
```
  f51a77fa
- Add WorkDir to the job record in the database. · 3c8ec590
  Danny Auble authored Aug 29, 2017
  
  3c8ec590
- Change --workdir in sbatch to be --chdir as in all other commands (salloc, srun) · fddc9853
  Danny Auble authored Aug 29, 2017
  
  fddc9853
- By default have Slurm dynamically link to libslurm.so instead of static linking · b493efc3
  Danny Auble authored Aug 29, 2017
  
  b493efc3
- Make it so a backup DBD doesn't attempt to create database tables and · fc7de2ee
  Danny Auble authored Aug 29, 2017
```
relies on the primary to do so.

There is a potential race condition if the backup DBD tries to create/check the
database at the same time as the primary.  This patch removes this race by not
allowing the backup to do the check/create.

Bug 3827
```
  fc7de2ee
25 Aug, 2017 2 commits
- Set SLURM_GTIDS and SLURM_NODEID env vars for pack jobs · 605fb6d6
  Morris Jette authored Aug 25, 2017
```
These are required by OpenMPI
```
  605fb6d6
- Fix for srun --mpi=list output · c125875f
  Morris Jette authored Aug 25, 2017
```
Modify output of "--mpi=list" to avoid duplicates for version numbers in
    mpi/pmix plugin names.
```
  c125875f
24 Aug, 2017 2 commits

Add file bcast suppot for pack jobs · 58b21490

Morris Jette authored Aug 24, 2017

Modify sbcast command and srun's --bcast option to support heterogeneous
      jobs.
bug 4099

58b21490

Prevent slurmstepd ABRT when parsing gres.conf CPUs. · 3e1fffb6

Alejandro Sanchez authored Aug 24, 2017

Calling bit_unfmt() with a zero bit_size() bitmap leads to a later
call to bit_nclear() with start=0 and stop=-1, leading to the ABRT.

This scenario happened when cgroup.conf has ConstrainDevices=yes and
task_cgroup_devices_create() tries to collect the GRES devices
but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt);
of zero size which is passed by argument to bit_unfmt().

gres_cpu_cnt is 0 because we have defined a gres.conf like this:

Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1
Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1
Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3
Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3

but have no GresTypes nor GRES option in the slurm.conf / node config def.

Bug 3974

3e1fffb6

23 Aug, 2017 1 commit

jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df

Alejandro Sanchez authored Aug 23, 2017

Running slurmctld under valgrind while operating with jobcomp/elasticsearch
reported the following bytes definitely lost:

==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
==27403==    by 0x22856A: makespace (xstring.c:114)
==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
==27403==    by 0x83C5BCD: ???
==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)

It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
the corresponding job_node->serialized_job, but the originally generated buffer
wasn't freed afterwards. The fix consists in change the transfer so that instead
of xstrdup'ing the char * we just assign the pointer and NULL the buffer.

The job_node->serialized_job was already xfree'd properly later when the job
was indexed.

Discovered while working on Bug 4065.

8172b7df

22 Aug, 2017 2 commits
- Strip trailing slashes from the JobCompLoc for jobcomp/elasticsearch. · 60eed77f
  Alejandro Sanchez authored Aug 22, 2017
```
Otherwise the resulting URL may be invalid. Update documentation
while here as well.

Bug 4065.
```
  60eed77f
- In salloc with --uid option, drop supplementary groups before changing UID · 1efbd459
  Philip Kovacs authored Aug 22, 2017
```
bug 4095
```
  1efbd459
21 Aug, 2017 2 commits

Print numbers using exponential format as needed · c125759d

Isaac Hartung authored Aug 21, 2017

Print numbers using exponential format if required to fit in allocated
    field width. The sacctmgr and sshare commands are impacted.
bug 1749

c125759d

select/cons_res - fix bug with Dragonfly and --switches count timeout · 46c0919d

Alejandro Sanchez authored Aug 21, 2017

Given a configuration with TopologyParam including Dragonfly option, if a
job requested --switches count, the count timeout specified by either
the job request or max_switch_wait SchedulerParameters was not respected.
This was due to leaf_switch_count variable not being incremented in
_eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(),
the later being a execution path which already succeed to wait for the
switch count timeout.

Bug 4056

46c0919d

18 Aug, 2017 3 commits
- Enable SPANK plugin support on a per-heterogeneous job component basis · d1335bf8
  Morris Jette authored Aug 18, 2017
  
  d1335bf8
- Fix QOS usage factor applying to TRES run mins · d2f08d4a
  Brian Christiansen authored Aug 18, 2017
  
  d2f08d4a
- jobcomp/script - Add a bunch of new fields. · 2c1f44b0
  Alejandro Sanchez authored Aug 18, 2017
```
Add the following fields as environment variables:
CLUSTER, DEPENDENCY, DERIVEDEC, EXITCODE, GROUPNAME, QOS, RESERVATION,
USERNAME.

LIMIT env variable value format (which means the TimeLimit of the job)
has been modified to D-HH:MM:SS.

Bug 3942
```
  2c1f44b0
17 Aug, 2017 2 commits
- Clean up partial step creation · 08bca019
  Morris Jette authored Aug 17, 2017
```
In srun, if only some steps are allocated and one step allocation fails,
 then delete all allocated steps.
```
  08bca019
- mpi/mvapich - Buffer being only partially cleared. No failures observed. · e7831316
  Morris Jette authored Aug 16, 2017
```
Coverity CID 44649

Bug 4085
```
  e7831316
16 Aug, 2017 3 commits
- Set SLURM_NTASKS for pack job · 540804f5
  Morris Jette authored Aug 16, 2017
```
Set SLURM_NTASKS environment variable to reflect global task count
(needed by MPI).
```
  540804f5
- Set SLURM_PROCID for pack job · eb319e30
  Morris Jette authored Aug 16, 2017
```
Set SLURM_PROCID environment variable to reflect global task rank
(needed by MPI).
```
  eb319e30
- Add 'slurmdbd:' to the accounting plugin to notify message is from dbd · 8014b5a4
  Danny Auble authored Aug 15, 2017
```
instead of local.

Bug 3546
```
  8014b5a4
15 Aug, 2017 4 commits
- Start v17.11.0-pre3 NEWS · 395b6eec
  Morris Jette authored Aug 15, 2017
  
  395b6eec
- Revert commit 6dc64628 · ef4d83c2
  Morris Jette authored Aug 15, 2017
```
bug 3217
```
  ef4d83c2
- Start NEWS for v 17.02.8 · 0de4a43b
  Morris Jette authored Aug 15, 2017
  
  0de4a43b
- propagate pack job application info · a51f4799
  Morris Jette authored Aug 15, 2017
```
If srun lacks application specification for some component, the next one
      specified will be used for earlier components.
```
  a51f4799