Commits · c8dd9790878dff20137ea461d46784bdc332e4a7 · Manuel G. Marciani / ces_slurm_simulator

07 Mar, 2016 1 commit

Added per job array task dependencies · c8dd9790

Dominik Bartkiewicz authored Mar 07, 2016

Added new job dependency type of "aftercorr" which will start a task of a
    job array after the corresponding task of another job array completes.
bug 2460

c8dd9790

05 Mar, 2016 7 commits
- Fix some node reboot timing · e7cd9c24
  Morris Jette authored Mar 04, 2016
```
Fix some timing issues with respect to rebooting a node, especailly
KNL node needing reboot to change configuration.
```
  e7cd9c24
- Make it so jobs/steps track ':' named gres/tres, before hand gres/gpu:tesla · 0cd69296
  Danny Auble authored Mar 04, 2016
```
would only track gres/gpu, now it will track both gres/gpu and
gres/gpu:tesla as separate gres if configured like
AccountingStorageTRES=gres/gpu,gres/gpu:tesla
```
  0cd69296
- Merge remote-tracking branch 'origin/slurm-15.08' · 7c9cc617
  Danny Auble authored Mar 04, 2016
  
  7c9cc617
- Continuation to commit b294f81b to do the right thing for jobs. · 35f7a262
  Danny Auble authored Mar 04, 2016
  
  35f7a262
- Merge remote-tracking branch 'origin/slurm-15.08' · 0a8e2d43
  Danny Auble authored Mar 04, 2016
  
  0a8e2d43
- Fixed double read lock on getting job's gres/tres. · b23a57cf
  Danny Auble authored Mar 04, 2016
  
  b23a57cf
- Move common code into a single function. This also allows requests like · 7153bbfd
  Danny Auble authored Mar 04, 2016
```
--gres=gpu:tesla

before you needed to give a count

--gres=gpu:tesla:1

now both should work.
```
  7153bbfd
04 Mar, 2016 9 commits
- Merge remote-tracking branch 'origin/slurm-15.08' · 90935701
  Danny Auble authored Mar 04, 2016
  
  90935701
- Continuation of commit 7f0bdc84 · 55a678dd
  Danny Auble authored Mar 04, 2016
```
Step GRES value changed from type "int" to "int64_t" to support larger
values.

Signed-off-by: Danny Auble <da@schedmd.com>
```
  55a678dd
- Merge remote-tracking branch 'origin/slurm-15.08' · cdb6ab5d
  Danny Auble authored Mar 04, 2016
  
  cdb6ab5d
- Fix issue where steps weren't always getting the gres/tres involved. · b294f81b
  Danny Auble authored Mar 04, 2016
  
  b294f81b
- parsing of scheduling parameters · 9beeb3a6
  Morris Jette authored Mar 04, 2016
```
These changes apply to both the main scheduling logic and backfill
scheduler. If some SchedulerParameters value was configured, the
slurmctld started, then completely removed, and slurmctld reconfigured
the value would not be reset to it's default value but the originally
configured value would persist until slurmctld restarted.
```
  9beeb3a6
- Fix NEWS entry. · d2b913a2
  Brian Christiansen authored Mar 03, 2016
```
Continuation of 31225a82
```
  d2b913a2
- Fix for empty node bitmap · 7c1d0c5d
  Morris Jette authored Mar 03, 2016
```
Harden code to not fail if node_bitmap passed to _update_node_gres()
has no bits set.
```
  7c1d0c5d
- Merge branch 'tasks_per_core' · 31225a82
  Brian Christiansen authored Mar 03, 2016
  
  31225a82
- Fix for tasks being packed onto core when --ntasks-per-core=1 and --cpus-per-task > threads. · b11ec103
  Brian Christiansen authored Mar 03, 2016
```
Bug 2430
```
  b11ec103
03 Mar, 2016 14 commits

Add slurmstepd logging just before fork/exec · 916b5e3e
Morris Jette authored Mar 03, 2016
```
This may be helpful for timing purposes. Added by Cray request.
```
916b5e3e

find job path once · 84023f27

Morris Jette authored Mar 03, 2016

Unless a job is running in --multi-prog mode, modify the logic to
resolve the job's path once rather than once for each task. This
may slightly improve performance (requested by Cray).

84023f27

Simplify the bcast code to use the normal send_resv_msgs logic instead · a5eb66a6
Danny Auble authored Mar 03, 2016
```
of it's very close version.
```
a5eb66a6

Defer slurmd registration until NodeHealthCheck · 7fb0c981

Thomas Hamel authored Mar 03, 2016

We want to introduce a new behavior in the way slurmd uses the
HealthCheckProgram. The idea is to avoid a race condition between the
first HealthCheckProgram run and the node accepting jobs. The slurmd
daemon will initialize and then loop on HealthCheckProgram execution
before registering with slurmctld. It will stay in this loop until
the HealthCheckProgram returns successfully (the node is still DOWN).

On our clusters we are using NHC as an HealthCheckProgram. NHC drains
the node if it fails and remove the drain if it is successfull, this
behavior fits well with our purpose. This behavior permits us to start
slurmd at boot without setting up a complex boot sequence in the init
system, slurmd just wait for the node to be ready before registering.

The HealthCheckProgram is not run during slurmd startup if
HealthCheckInteval is 0.

7fb0c981

Merge remote-tracking branch 'origin/slurm-15.08' · 50286191
Danny Auble authored Mar 03, 2016

50286191
Fix issue with sbcast not doing a correct fanout. · 72f13426
Danny Auble authored Mar 03, 2016

72f13426
Fix getting reservations to database when database is down. · 5c43d754
Brian Christiansen authored Mar 03, 2016
```
Bug 2507
```
5c43d754
KNL HBM as a GRES starting to work · a862aa15
Morris Jette authored Mar 03, 2016

a862aa15
Merge branch 'slurm-15.08' · fa068ad2
Morris Jette authored Mar 03, 2016

fa068ad2

Increase step GRES variable size · 7f0bdc84

Morris Jette authored Mar 03, 2016

Step GRES value changed from type "int" to "int64_t" to support larger
values. Previous logic could fail in step allocation values over 32-bits.
Other GRES values are 64-bit.

7f0bdc84

Merge branch 'slurm-15.08' · 3ea10e24
Tim Wickberg authored Mar 03, 2016

3ea10e24
Replace local function _str_cmp from wiki2/get_nodes with xstrcmp. · 0578d63f
Tim Wickberg authored Mar 02, 2016

0578d63f

Force close on exec on first 256 file descriptors when launching a · f502f1e5

Danny Auble authored Mar 02, 2016

slurmstepd to close potential open ones.

It was pointed out the slurmd using acct_gather_energy/ipmi links to
freeipmi which could possibly open /dev/ipmi0 without the close on exec
flag set as root while launching a step leaving it open in the users app.

What this does is sets the flag on the first 256 to mitigate the concern.

Reported by Maksym Planeta.

Bug 2506

f502f1e5

node_feature GRES work · a204799f

Morris Jette authored Mar 02, 2016

This enables the node_feature plugin to add GRES to nodes. Specifically
it is intended for the node_feature/knl_cray plugin to build a GRES
containing the MCDRAM size currently configured on the node. More
work is needed for full functionality.

a204799f

02 Mar, 2016 9 commits

Correctly package capmc_suspend/resume · bf4f5759

Morris Jette authored Mar 02, 2016

Make sure that capmc_suspend and capmc_resume are properly packaged
  in an RPM if a non-standard sbin location is configured

bf4f5759

OPENLAVA: Smarter logic to determine if the executable is an actual · d40389d4
Danny Auble authored Mar 02, 2016
```
bsub batch script or not.  If it isn't we will wrap the script to avoid
issues where $0 is used inside the script.
```
d40389d4
Merge branch 'slurm-15.08' · 787dcbf1
Morris Jette authored Mar 02, 2016
```
Conflicts:
	src/plugins/sched/backfill/backfill.c
```
787dcbf1

Backfill scheduler to validate correct job partition · efd9d35e

Gary B Skouson authored Mar 02, 2016

Previous logic tested whatever the job's partition pointer indicated
rather than the partition we are trying to run the job in. This bug
was introduced in Slurm version 15.08.5, Nov 16, 2015, commit
94f0e948
bug 2499

efd9d35e

Move definition to only place used to avoid confusion, continuation of · f257976a
Danny Auble authored Mar 02, 2016
```
patch 2d5066e7
```
f257976a

Power save mode configure refactor · 743cabc7

Morris Jette authored Mar 02, 2016

Add a new function that can read power save configuration information
  before starting the power save thread. This lets us confirm
  that power save mode is configured to run earlier in the slurmctld
  start up logic and report an error at at earlier point if power
  save is not configured to run, but node_feature/knl_cray (which
  needs it) is configured.

743cabc7

Update documentation for change default cgroup mount of /sys/fs/cgroup · 7da25924
Tim Wickberg authored Mar 02, 2016

7da25924

Change default CgroupMountpoint (in cgroup.release example) · 1f84f3a2

Thomas Cadeau authored Mar 02, 2016

Introduced in c97e08a0
Change default CgroupMountpoint (in cgroup.conf) from "/cgroup" to
    "/sys/fs/cgroup" to match current standard.
For details, see https://wiki.freedesktop.org/www/Software/systemd/PaxControlGroups/

1f84f3a2

Merge branch 'slurm-15.08' · bd436fe8
Tim Wickberg authored Mar 02, 2016

bd436fe8