Commits · 27f2c352eb5dbb518c28da6d4a589cf5072f5d94 · Manuel G. Marciani / ces_slurm_simulator

15 Sep, 2017 2 commits
- Merge in slurm_thread_create branch. Also mention MPI plugin cleanup. · f52402fc
  Tim Wickberg authored Sep 15, 2017
  
  f52402fc
- Fix to _is_gres_cnt_zero() return false for improper input string eg.: xxx:xx:x, xxx: · 3758d426
  Dominik Bartkiewicz authored Sep 15, 2017
  
  3758d426
14 Sep, 2017 2 commits

Prevent a second PMI2_Init call from leaving a hung slurmstepd process. · b2aa25d5

Tim Wickberg authored Sep 14, 2017

A second PMI2_Init() within the same step is invalid, and cannot succeed.

Return an error code back to the client end, and close the fd to force the
step to terminate immediately.

Due to a bug in our libpmi code, just returning a cmd=response_to_init with
an appropriate rc number will not tear down the connection properly, so
send back something else that will trigger the error path.

Bug 3520.

b2aa25d5

Pack step cancel work · 20e660e1

Morris Jette authored Sep 14, 2017

A request to cancel a pack step leader will result in that step
  being cancelled on all pack job components. Needed by MPI.

20e660e1

13 Sep, 2017 3 commits
- mpi/pmi2 - fix shutdown race condition · b018470f
  Morris Jette authored Sep 13, 2017
```
mpi/pmi2 plugin - vestigial pointer could be referenced at shutdown with
    invalid memory reference resulting.
```
  b018470f
- Document NewName option to sacctmgr. · d08f34f2
  Josh Samuelson authored Sep 12, 2017
```
Bug 4154.
```
  d08f34f2
- Automatically set partition for a reservation if not specified. · be393a86
  Dominik Bartkiewicz authored Sep 12, 2017
```
Use the partition specified in the reservation, rather than
using the cluster's default.

Bug 4030.
```
  be393a86
12 Sep, 2017 7 commits

Fix default location for cgroup_allowed_devices_file.conf to use correct · 1e78c111

Danny Auble authored Sep 12, 2017

default path.

This makes it so you don't always have to put AllowedDevicesFile in your
cgroup.conf file if your etc dir is anything other than /etc/slurm.

1e78c111

Pack job debugger synchronization · a8d2a04f

Morris Jette authored Sep 12, 2017

Don't flag a job as "SPAWNED" for debugger until all process
  information is available for all pack job components.

a8d2a04f

Fix autoconf test for libcurl when clang is the compiler. · d670de2d

Tim Wickberg authored Sep 12, 2017

Adding a newline prevents this error:
conftest.c:154:8: error: if statement has empty body [-Werror,-Wempty-body]

d670de2d

If creating/altering a core based reservation with scontrol/sview on a · 3b3e67e1
Alejandro Sanchez authored Sep 12, 2017
```
remote cluster correctly determine the select type.

Bug 2329
```
3b3e67e1

Modify srun --mpi=list output · 83058057

Morris Jette authored Sep 11, 2017

Modify "srun --mpi=list" output to match valid option input by removing the
    "mpi/" prefix on each line of output.

83058057

Disable heterogeneous steps by default · fe2daac7
Morris Jette authored Sep 11, 2017
```
Enable them onlyh with SchedulerParameters=enable_hetero_jobs OR
  MPI type is "none"
```
fe2daac7

Speedup arbitrary distribution algorithm · 011db71c

Brian Christiansen authored Sep 11, 2017

Do pointer comparisons rather than strcmps.
~80x speedup
Bug 3529

e.g.
1000 nodes
8000 tasks
[Sep 11 14:24:15.873639 20992 srvcn        0x7f8c1cdda700] _task_layout_hostfile: hostfile processing took usec=2152678 (orig)
[Sep 11 14:27:46.173424 20992 srvcn        0x7f8c1c6d3700] _task_layout_hostfile: hostfile processing took usec=2142997 (orig)
[Sep 11 14:32:32.245420  4037 srvcn        0x7f12de4e4700] _task_layout_hostfile: hostfile processing took usec=26198 (node ptrs)
[Sep 11 14:36:12.88769   4037 srvcn        0x7f12de6e6700] _task_layout_hostfile: hostfile processing took usec=25515 (node ptrs)
[Sep 11 14:41:38.339162  4037 srvcn        0x7f132c8d5700] _task_layout_hostfile: hostfile processing took usec=27459 (node ptrs)
[Sep 11 15:16:59.575189  1874 srvcn        0x7f3dae3f0700] _task_layout_hostfile: hostfile processing took usec=30129 (node ptrs)
[Sep 11 15:20:50.365004  1874 srvcn        0x7f3dc8b34700] _task_layout_hostfile: hostfile processing took usec=29884 (node ptrs)

011db71c

11 Sep, 2017 2 commits
- Remove srun --mpi-combine option · 00f02892
  Morris Jette authored Sep 11, 2017
  
  00f02892
- Add cluster name to smail output · d26370f8
  Ole H Nielsen authored Sep 11, 2017
  
  d26370f8
08 Sep, 2017 6 commits

Fix two GCC 7.1 warnings. · 901c3aec

Dominik Bartkiewicz authored Sep 08, 2017

If /proc was inaccessible proc_name would leak.

Put an explicit length cap in sprintf to avoid warning. The
size is checked immediate before here so this is just making
the 10-char limit explicit.

Bug 4062.

901c3aec

Add burst buffer support for heterogeneous jobs · 241f31d7
Morris Jette authored Sep 08, 2017

241f31d7

Harden slurmctld HA to mitigate certain HA issues. · e69449b6

Tim Wickberg authored Sep 07, 2017

If the network path to shared storage used for the StateSaveLocation
is separate from that used to communicate with the cluster, both the
primary and backup controllers can end up acting as master on loss
of the cluster network.

Alter the HA takeover code path to make sure that the job state
save file is not still being updated by the primary slurmctld.
If it is, refuse to takeover and retry again later.

Bug 3592.

e69449b6

Address some build warnings from GCC 7.1. · 919138f4
Dominik Bartkiewicz authored Sep 07, 2017
```
Bug 4062.
```
919138f4
Show nodes as down when in a maint resv · 6c2e132c
Marshall Garey authored Sep 07, 2017

6c2e132c
Allow nodes to be rebooted while in maint resv · 8c5abdd7
Marshall Garey authored Sep 07, 2017
```
Bug 4084
```
8c5abdd7

07 Sep, 2017 2 commits
- Optimization enhancements for partition based job preemption · 0f501359
  Dominik Bartkiewicz authored Sep 07, 2017
```
bug 3824
```
  0f501359
- Cray: Don't run step NHC on external step · a6407a68
  Morris Jette authored Sep 07, 2017
```
Do not run the Node Health Check on termination of the external
  step as this happens when the job allocation ends and the job
  NHC will be executed anyway.
Bug 4074
```
  a6407a68
05 Sep, 2017 1 commit
- Update docs for sched_min_interval default change · 623bd0fd
  Brian Christiansen authored Sep 05, 2017
```
From ea621ec4
```
  623bd0fd
04 Sep, 2017 1 commit

Fix to test job mem against MaxMemPer[CPU|Node] limits at scheduling time. · 24365514

Alejandro Sanchez authored Sep 04, 2017

Initially job mem limits were tested at submission time through
_validate_min_mem_partition() -> _valid_pn_min_mem(), but not tested
again at scheduling time, thus leading to jobs incorrectly being scheduled
against partitions where the job exceeded their MaxMemPer* limit
(which can in turn be inherited from the system wide limit too).

NOTE: New WAIT_PN_MEM_LIMIT job_state_reason enum component added to support
this new waiting reason.

Bug 2291.

24365514

02 Sep, 2017 1 commit

Make all steps from an srun have same ID · e89155c5

Morris Jette authored Sep 01, 2017

Modify step create logic so that call components of a heterogeneous job
      launched by a single srun command have the same step ID value.

e89155c5

01 Sep, 2017 4 commits

Check multiple partition limits when scheduling a job that were previously only · e566cf39
Danny Auble authored Sep 01, 2017
```
checked on submit.

This only mattered when submitting a job to multiple partitions.

Bug 4066
```
e566cf39
Fix sbatch --signal to signal all MPI ranks in a step instead of just those · d8485b0d
Danny Auble authored Aug 31, 2017
```
on node 0.

Bug 4035
```
d8485b0d

pack job allocation work · 7d46d43e

Morris Jette authored Sep 01, 2017

Prevent a heterogeneous job allocation from including the same nodes in
      multiple components (required by MPI jobs spanning components).

7d46d43e

Do not print TmpDisk as part of 'slurmd -C' line. · 6aecc72d

Tim Wickberg authored Aug 31, 2017

To avoid a reliance on the slurm.conf file (as this command is
designed to help you construct said file) the path to check the
space was hard-coded as /tmp. But, if TmpFS is meant to be elsewhere
we'd emit the wrong value here.

Bug 3272.

6aecc72d

31 Aug, 2017 1 commit
- Fix - Raise an error back to the user when trying to update currently · 038f885d
  Alejandro Sanchez authored Aug 30, 2017
```
unsupported core-based reservations.
```
  038f885d
30 Aug, 2017 3 commits

Fix statically linked applications to CRAY's PMI. · 85d83258

David Gloe authored Aug 30, 2017

Statically linked Cray PMI applications still expect to use some file paths
containing the old SLURM_ID_HASH format. Some Cray customers have
certification requirements that make recompilation difficult.

The attached patch defines a macro to convert the new SLURM_ID_HASH
to the old format, and writes the files and symlinks necessary for statically
linked Cray PMI applications to work.

Bug 4114

85d83258

Add LaunchParameters=batch_step_set_cpu_freq to allow the setting of the cpu · 8f1d06c3
Danny Auble authored Aug 30, 2017
```
frequency on the batch step.

Bug 4073
Also see Bug 3510
```
8f1d06c3
Add missing NEWS item for MaxQueryTimeLimit option. · 25bc337c
Tim Wickberg authored Aug 29, 2017

25bc337c

29 Aug, 2017 5 commits
- Make the UsageFactor of a QOS work when a qos has the nodecay flag. · f51a77fa
  Brian Christiansen authored Aug 29, 2017
```
Bug 4090
```
  f51a77fa
- Add WorkDir to the job record in the database. · 3c8ec590
  Danny Auble authored Aug 29, 2017
  
  3c8ec590
- Change --workdir in sbatch to be --chdir as in all other commands (salloc, srun) · fddc9853
  Danny Auble authored Aug 29, 2017
  
  fddc9853
- By default have Slurm dynamically link to libslurm.so instead of static linking · b493efc3
  Danny Auble authored Aug 29, 2017
  
  b493efc3
- Make it so a backup DBD doesn't attempt to create database tables and · fc7de2ee
  Danny Auble authored Aug 29, 2017
```
relies on the primary to do so.

There is a potential race condition if the backup DBD tries to create/check the
database at the same time as the primary.  This patch removes this race by not
allowing the backup to do the check/create.

Bug 3827
```
  fc7de2ee