Commits · 70aafa68b19a1d6819f1823ebdc0c1c103f2c9b6 · Manuel G. Marciani / ces_slurm_simulator

06 May, 2016 1 commit

Correct partition MaxCPUsPerNode enforcement · 70aafa68

Marco Ehlert authored May 05, 2016

I would like to mention a problem which seems to be a calculation bug of
used_cores in slurm version 15.08.7

If a node is divided into 2 partitions using MaxCPUsPerNode like this
slurm.conf configuration

    NodeName=n1 CPUs=20
    PartitionName=cpu NodeName=n1    MaxCPUsPerNode=16
    PartitionName=gpu NodeName=n1    MaxCPUsPerNode=4

I run into a strange scheduling situation.
The situation occurs after a fresh restart of the slurmctld daemon.

I start jobs one by one:

case 1
    systemctl restart slurmctld.service
    sbatch -n 16 -p cpu cpu.sh
    sbatch -n 1  -p gpu gpu.sh
    sbatch -n 1  -p gpu gpu.sh
    sbatch -n 1  -p gpu gpu.sh
    sbatch -n 1  -p gpu gpu.sh

    => Problem now: The gpu jobs are kept in PENDING state.

This picture changes if I start the jobs this way

case 2
    systemctl restart slurmctld.service
    sbatch -n 1  -p gpu gpu.sh
    scancel <gpu job_id>
    sbatch -n 16 -p cpu cpu.sh
    sbatch -n 1  -p gpu gpu.sh
    sbatch -n 1  -p gpu gpu.sh
    sbatch -n 1  -p gpu gpu.sh
    sbatch -n 1  -p gpu gpu.sh

and all jobs are running fine.

By looking into the code I figured out a wrong calculation of 'used_cores' in
function _allocate_sc()

plugins/select/cons_res/job_test.c

_allocate_sc(...)
...
         for (c = core_begin; c < core_end; c++) {
                 i = (uint16_t) (c - core_begin) / cores_per_socket;

                 if (bit_test(core_map, c)) {
                         free_cores[i]++;
                         free_core_count++;
                 } else {
                         used_cores[i]++;
                 }
                 if (part_core_map && bit_test(part_core_map, c))
                         used_cpu_array[i]++;

This part of code seems to work only if the part_core_map exists for a
partition or on a completly free node. But in case 1 there is no
part_core_map for gpu created yet. Starting a gpu  the core_map contains
4 cores left from the cpu job. Now all non free cores of the cpu partion
are counted as used cores in the gpu partition and this condition will
match in the next code parts

    free_cpu_count + used_cpu_count >  job_ptr->part_ptr->max_cpus_per_node

what is definitely wrong.

As soon as a part_core_map appears, means a gpu job was started on a free
node (case 2) then there is no problem at all.

To get case 1 work I changed the above code to the following and all works
fine:

         for (c = core_begin; c < core_end; c++) {
                 i = (uint16_t) (c - core_begin) / cores_per_socket;

                if (bit_test(core_map, c)) {
                         free_cores[i]++;
                         free_core_count++;
                 } else {
                     if (part_core_map && bit_test(part_core_map, c)){
                         used_cpu_array[i]++;
                         used_cores[i]++;
                     }
                 }

I am not sure this code change is really good, but it fixes my problem.

70aafa68

05 May, 2016 3 commits
- Correct NEWS header · 82a05778
  Morris Jette authored May 05, 2016
  
  82a05778
- Don't power down dead node · b4904661
  Morris Jette authored May 05, 2016
```
Do not attempt to power down a node which has never responded if the
    slurmctld daemon restarts without state.
bug 2698
```
  b4904661
- Make it so the tres units in a job's formatted string are converted like · 33746208
  Danny Auble authored May 04, 2016
```
they are in a step.
```
  33746208
04 May, 2016 3 commits

Cleanup Coverity warnings about unnecessary null check and dead code. · 17a9d97e

Tim Wickberg authored May 04, 2016

1) step_ptr->step_layout has already been dereferenced plenty of times.

2) Can't possible have rpc_version >= MIN_PROTOCOL_VERSION and < 8,
   this code is dead.

17a9d97e

capmc_resume: operate on all nodes · b7613fe2

Morris Jette authored May 04, 2016

Issue the "node_reinit" command on all nodes identified in a single
   call to capmc. Only if that fails will individual nodes be
   restarted using multiple pthreads. This improves efficiency
   while retaining the ability to operate on individual nodes
   when some failure occurs.
bug 2659

b7613fe2

Update META/NEWS for v16.05.0rc1 tag · 1b4b155c
Danny Auble authored May 03, 2016

1b4b155c

03 May, 2016 5 commits
- Fix sacctmgr to remove a user who has no associations. · 7651754d
  Danny Auble authored May 03, 2016
  
  7651754d
- Enable prefixes in slurmstepd debugging. · 9523a5bc
  Brian Christiansen authored May 03, 2016
```
E.g. info, debug, etc.
```
  9523a5bc
- Fix potential gres underflow on restart of slurmctld · bd93b6e6
  Brian Christiansen authored May 03, 2016
  
  bd93b6e6
- Clarify behavior of 'srun --export=NONE' in man page. · 370695a1
  Tim Wickberg authored May 02, 2016
  
  370695a1
- Correct prolog_epilog.shtml with current behavior if Prolog fails. · 0c3f2709
  Eric Martin authored May 02, 2016
  
  0c3f2709
29 Apr, 2016 6 commits
- MYSQL - Remove 'ignore' from alter ignore when updating a table. · 2fd4d7a6
  Danny Auble authored Apr 29, 2016
```
Backport of commit cca1616b from 16.05
```
  2fd4d7a6
- Revert part of 3ad1c2b6. Fixes 15.08 'scontrol show config'. · 0ad05902
  Tim Wickberg authored Apr 29, 2016
```
MCS plugin should not have been retroactively added to the 15.08 RPCs, and
caused 'scontrol show config' from a 15.08 scontrol to a 16.05 slurmctld to
fail.
```
  0ad05902
- MYSQL - Fix incorrect usage of limit and union. · f3ea99a2
  Danny Auble authored Apr 29, 2016
  
  f3ea99a2
- Fix incorrect type when initializing header of a message · 5a68feb1
  Danny Auble authored Apr 29, 2016
  
  5a68feb1
- Better initialization of node_ptr when dealing with protocol_version. · 01ec8949
  Danny Auble authored Apr 29, 2016
  
  01ec8949
- Make sacct and sstat work with older slurmd versions · a1ea1e2a
  Brian Christiansen authored Apr 28, 2016
  
  a1ea1e2a
28 Apr, 2016 4 commits
- Fix for srun hang with OpenMPI and PMIx · af47b4b2
  Artem Polyakov authored Apr 28, 2016
```
See bug 2672 for details
```
  af47b4b2
- Move NEWS entry from e8e5d408 into the correct location. · 4291ac95
  Tim Wickberg authored Apr 27, 2016
  
  4291ac95
- Fix version issue when packing GRES information between 2 different versions · e8e5d408
  Danny Auble authored Apr 27, 2016
```
of Slurm.
```
  e8e5d408
- Use TaskPluginParam as default · a01e6562
  Morris Jette authored Apr 27, 2016
```
Use TaskPluginParam for default task binding if no user specified CPU
    binding. User --cpu_bind option takes precident over default. No longer
    any error if user --cpu_bind option does not match TaskPluginParam.
bug 2655
```
  a01e6562
27 Apr, 2016 4 commits

Fix test cases to have proper signature for main() · 7925aa42
Tim Wickberg authored Apr 27, 2016
```
Compiler errors out preventing these 13 from running without fixing the implied
int type for main.
```
7925aa42

Generate init.d and systemd service scripts in etc/ through Make · 5176f7bc

Tim Wickberg authored Apr 26, 2016

Do not use AC_CONFIG_FILES as this may not expand all variables at config time.

Loosely based on recommendations from http://www.gnu.org/software/autoconf/manual/autoconf.html#Makefile-Substitutions

Run autogen.sh to pick up changes as well.

Bug 2247/2298.

5176f7bc

Add support to ntasks_per_socket in task/affinity · 31aa3244
Morris Jette authored Apr 27, 2016
```
Prior logic only supported ntasks_per_core
bug 2655
```
31aa3244

Fix invalid error with two task plugins · 410e8c6c

Morris Jette authored Apr 26, 2016

Avoid error message of "Requested cpu_bind option requires entire node to
be allocated; disabling affinity" being generated in some cases where
task/affinity and task/cgroup plugins used together.

410e8c6c

26 Apr, 2016 4 commits
- Bluegene - Fix issue with reservations resizing under the covers on a · c18264a6
  Danny Auble authored Apr 26, 2016
```
restart of the slurmctld.
```
  c18264a6
- Prevent integer overflow in memory limit calculation by adding cast. · fe85cc35
  Sam Gallop authored Apr 26, 2016
```
Otherwise miscalculated limit will lead to job cancellation even when well
inside the allocated amount.

Bug 2660.
```
  fe85cc35
- Update accounting when changing QOS on pending jobs. · 848091e2
  Brian Christiansen authored Apr 25, 2016
```
Bug 2386
```
  848091e2
- Don't allow QOS to be changed on running jobs. · 05eac196
  Brian Christiansen authored Apr 25, 2016
```
Bug 2386
```
  05eac196
25 Apr, 2016 1 commit

Update docs for Shared -> OverSubscribe and Priority -> PriorityTier · a1510a83

Tim Wickberg authored Apr 25, 2016

Also remove remove misleading note "Unless PreemptType=preempt/partition_prio
the partition Priority is not critical"; it does still impact scheduling
when nodes overlap partitions.

a1510a83

23 Apr, 2016 1 commit
- Fix potential issue when adding and removing TRES which could result · fb22bcc2
  Tim Wickberg authored Apr 22, 2016
```
in the slurmdbd segfaulting.

Bug 2656
```
  fb22bcc2
21 Apr, 2016 2 commits

Fix sshare -o<format> to correctly display new lengths. · 08557f74
Brian Christiansen authored Apr 21, 2016

08557f74

burst_buffer/cray - fix create/desroy buffer only · 905ac850

Morris Jette authored Apr 20, 2016

burst_buffer/cray - Don't call Datawarp "paths" function if script includes
    only create or destroy of persistent burst buffer. Some versions of Datawarp
    software return an error for such scripts, causing the job to be held.
bug 2624

905ac850

20 Apr, 2016 3 commits

burst_buffer/cray - fix create/desroy buffer only · 1391d29a

Morris Jette authored Apr 20, 2016

burst_buffer/cray - Don't call Datawarp "paths" function if script includes
    only create or destroy of persistent burst buffer. Some versions of Datawarp
    software return an error for such scripts, causing the job to be held.
bug 2624

1391d29a

Support the intel_pstate scaling driver · a4f35c45

Janne Blomqvist authored Apr 20, 2016

I noticed that the CpuFreqDef config option was only partially implemented. The value was parsed, but the never used. So I took the liberty of re-purposing it to mean sort of the opposite, namely the frequency governor to use when running a job step in case the job doesn't explicitly provide any --cpu-freq option.

I also changed the default of the CpuFreqGovernors option to be "ondemand,performance", since ondemand isn't available with the intel_pstate driver.

Otherwise the patch should be relatively straightforward and only changes a few minor things here and there.

a4f35c45

Forgot to include NEWS entry in staged changes. · 836decb1
Tim Wickberg authored Apr 19, 2016

836decb1

15 Apr, 2016 1 commit

Network topology option · bd42eaf7

Morris Jette authored Apr 14, 2016

Add TopologyParam option of "TopoOptional" to optimize network topology
    only for jobs requesting it.
bug 2567

bd42eaf7

14 Apr, 2016 2 commits

file_bcast - add read/write locking to file transfer list · 0575fcb4

Tim Wickberg authored Apr 14, 2016

Timeout stalled transfers and cleanup related data structures. Default
to wait five minutes since last update. Hook onto registration/ping message
type to trigger cleanup in a minimally invasive manner.

While here restructure certain functions to use list_* functions
rather than iterate on the structures.

0575fcb4

Don't set stage_out email for a Cray Burst Buffer if not set. · 40f8cca3
Tim Wickberg authored Apr 14, 2016
```
Otherwise --mail-type=ALL will send an unexpected stage_out message back
to the user.

Bug 2541.
```
40f8cca3