Commits · 40a9bc5fc147f7f416f679b81ee887d2ff325841 · Manuel G. Marciani / ces_slurm_simulator

02 Mar, 2017 9 commits

Update NEWS for start of v17.02.2 work · 40a9bc5f
Morris Jette authored Mar 02, 2017

40a9bc5f
Start NEWS for v16.05.11 · 90f45b7a
Morris Jette authored Mar 02, 2017

90f45b7a

Fix for CPU binding for job steps run under a batch job · 5e80c115

Morris Jette authored Mar 02, 2017

This is a partial reversion of commit 69684648
NOTE: sbatch does not support --cpu_bind (although the documentation does list
  the option) and the --mem_bind options set SBATCH_* environment variables
  that nothing every looks at. In other words, it needs some work.
Bugs 3519 and 3188

5e80c115

Added "SyscfgTimeout" parameter to knl.conf configuration file · 32ded0c3
Felip Moll authored Mar 02, 2017
```
bug 3525
```
32ded0c3

Update NEWS for v16.05 mods · 3f649cec

Morris Jette authored Mar 02, 2017

Copy NEWS item updates from v16.05 applied since v17.02.0 tag to
  to NEWS for v17.02.1

3f649cec

power management data strucure change · 79b76b40

Morris Jette authored Mar 02, 2017

Convert a slurmctd power management data structure from array to list in
    order to eliminate the possibility of zombie child suspend/resume
    processes.
bug 3516

79b76b40

Do not set cpu frequency / governor on batch or extern steps. · 6ea92f5d

Tim Wickberg authored Mar 02, 2017

This now matches the behavior documented in sbatch.

This resolves a problem where the maximum cpu frequency would be set
to the minimum available on the node by the batch step. This is due
to the batch step leaving cpu_freq_{min,max,gov} uninitialized to zero,
which is then translated to a request to set the frequency to the lowest
available in the node. This did not impact 16.05 or earlier, as a request
for a zero frequency was ignored by a quirk of _cpu_freq_freqspec_num.
This quirk was removed by commit f40e1c01 before 17.02.0-rc1.

Bug 3510.

6ea92f5d

Refactor slurmctld agent logic to eliminate some pthreads · e58c2282
Morris Jette authored Mar 02, 2017
```
bug 3516
```
e58c2282
Increase number of current ResumePrograms that can be supported · d87f6de5
Morris Jette authored Mar 01, 2017
```
from 10 to 100.
bug 3516
```
d87f6de5

01 Mar, 2017 3 commits
- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly. · f338c4eb
  Alejandro Sanchez authored Mar 01, 2017
  
  f338c4eb
- Print formatted tres string when creating/updating a reservation. · 8dfffa28
  Danny Auble authored Feb 28, 2017
  
  8dfffa28
- Fix print of consumed energy in sstat when no energy is being collected. · 9a168d20
  Danny Auble authored Feb 28, 2017
  
  9a168d20
28 Feb, 2017 4 commits

If gres is NULL on a job don't try to process it when returning detailed · f7a24285
Dominik Bartkiewicz authored Feb 28, 2017
```
information about a job to scontrol.
```
f7a24285
Fix missing locks in gres logic to avoid potential memory race. · 58a2f450
Dominik Bartkiewicz authored Feb 28, 2017

58a2f450

Remove unneeded job lock when running assoc_mgr cache. This lock could · b17e2aee

Danny Auble authored Feb 28, 2017

cause potential deadlock when/if TRES changed in the database and the
slurmctld wasn't made aware of the change.  This would be very rare.

The lock was originally there to keep new jobs from grabbing the assoc
information.  If the lock was done afterwards the worst case is we get the new
information.

b17e2aee

Fix deadlock scenario when dumping configuration in the slurmctld. · 0c5e3508

Danny Auble authored Feb 28, 2017

It was determined we didn't need the write locks on the job and no locks were
needed on the node either.

Doing these different locked beforehand would create a window where you could
get a config write lock

0c5e3508

27 Feb, 2017 3 commits

Update slurm.spec file to note obsolete RPMs. · 95cf960a
Daniel Letai authored Feb 27, 2017

95cf960a

Reset job update time when job is held after begin failure · e62d288d

Morris Jette authored Feb 27, 2017

This will be triggered after either a burst buffer job_begin function
  or select plugin job_begin function fails. Without this change, the
  "squeue -i" and "scontrol show job" commands can report old job
  state information.
bug 3504

e62d288d

Burst_buffer/cray - Prevent slurmctld abort · 733b57dc

Tim Wickberg authored Feb 27, 2017

Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation
    fails. Now job will be held.
bug 3504

733b57dc

24 Feb, 2017 6 commits
- job_submit/lua - Add job "bitflags" field · f9fcae35
  Josko Plazonic authored Feb 24, 2017
```
bug 3182
```
  f9fcae35
- Add %x to sbatch/srun filename pattern to represent the job name. · 1a4237a3
  Tim Shaw authored Feb 24, 2017
  
  1a4237a3
- Update to sbatch/srun man pages to explain the "filename pattern" clearer · 047b991d
  Tim Shaw authored Feb 24, 2017
  
  047b991d
- Modify pam module to work when configured NodeName and NodeHostname differ · 1ff7252b
  Don Lipari authored Feb 24, 2017
```
bug 3473
```
  1ff7252b
- Add 17.02.1 to NEWS · d8376bec
  Danny Auble authored Feb 23, 2017
  
  d8376bec
- Update META for 17.02.0 tag · 2c5d4afc
  Danny Auble authored Feb 23, 2017
  
  2c5d4afc
23 Feb, 2017 6 commits
- Fix packing of NULL slurmdb_reservation_cond_t · df133644
  Brian Christiansen authored Feb 23, 2017
  
  df133644
- Fix packing of NULL slurmdb_clus_res_rec_t · 2260e158
  Brian Christiansen authored Feb 23, 2017
  
  2260e158
- Fix squeue to not limit the size of partition, burst_buffer, exec_host, or · 5a4a6044
  Danny Auble authored Feb 23, 2017
```
reason to 32 chars.
```
  5a4a6044
- Propogate NEWS from v15.08 to v16.05 · f49dba56
  Morris Jette authored Feb 23, 2017
  
  f49dba56
- Correct job resize script · f42f6943
  Morris Jette authored Feb 23, 2017
```
For job resize, correct logic to build "resize" script with new values.
    Previously the scripts were based upon the original job size.
bug 3498
```
  f42f6943
- slurm.spec - only install init scripts if service scripts aren't. · faf9b413
  Tim Wickberg authored Feb 22, 2017
```
Do not enable init scripts if not present.

Please note that, unlike the init scripts, service files are not
automatically enabled at this time.

Bug 3371.
```
  faf9b413
22 Feb, 2017 3 commits

Fix node reboot timing bug · 8431929d

Morris Jette authored Feb 22, 2017

If node boot in progress when slurmctld daemon is restarted, then allow
    sufficient time for reboot to complete and not prematurely DOWN the node as
    "Not responding".
bug 3494

8431929d

Fix for possible squeue parsing failure · 7b226965
Morris Jette authored Feb 21, 2017
```
Could result in squeue abort
Coverity error CID 44969
```
7b226965

squeue to load new data if job_id or user_id specified · dbf9a211

Morris Jette authored Feb 21, 2017

Reduces possibility of old data if job_id or user_id option specified
  with iterate option
Coverity error CID 44783

dbf9a211

21 Feb, 2017 1 commit

Increased maximum file size supported by sbcast · ee5fea6d

Morris Jette authored Feb 21, 2017

Increased maximum file size supported by sbcast from 2 GB (32-bit integer
    to 64-bits). This required changing the file broadcast RPC and several
    internal variables.
bug 3485

ee5fea6d

18 Feb, 2017 2 commits

Added ability to override the invoking uid for "scontrol update job" · 1e42df07
Tim Shaw authored Feb 17, 2017
```
by specifying "--uid=<uid>|-u <uid>".

# Conflicts:
#	NEWS
```
1e42df07

Fix controller/cmds talking to a pre-released DBD · ec350f17

Brian Christiansen authored Feb 17, 2017

A 17.02 controller,sacctmgr couldn't talk to a "master/17.11" DBD
because the 17.02 client was talking attempting to talk to the DBD with
the 17.02's MIN_PROTOCOL_VERSION -- which was 15.08 and is more than 2
version behind the master. The master's MIN_PROTOCOL_VERSION is 16.05,
so it couldn't unpack the messages.

The controller should always communicate at it's current protocol to the
DBD.

For federations, it's possible that a higher version controller could
talk to a lower version controller. So the cluster needs to talk to the
remote cluster using the remote cluster's protocol version -- which is
given back from the DBD.

ec350f17

17 Feb, 2017 3 commits

Add 'preempt_youngest_order' option to preempt/partition_prio plugin. · 4e045105

Dominik Bartkiewicz authored Feb 17, 2017

Enable through SchedulerParameters. Will sort by youngest jobs first,
rather than based on priority. Use alongside 'preempt_strict_order' if
you don't want the plugin to try to further optimize the preemption
list.

Bug 3457.

4e045105

Fix potential race condition in job_time_limit. · cc82087a

Dominik Bartkiewicz authored Feb 16, 2017

Introduced by commit 059275f6 when the timer is trigger.
Releasing the locks means that job_ptr may point to an element that was
deleted by a different thread in the meantime. Restructuring the code
to advance the iterator prevents this - the iterator itself does not have
this issue as the List structure will manage the position during the
sleep().

While here, move the reservation update handling outside of this
loop to simplify operation. This does not need to piggy-back on the
scan of the job_list - switching to using list_for_each should
mitigate some of the performance loss by needing a second full pass.

Bug 3414.

cc82087a

job_submit/lua - remove access to reservation job_run_cnt/job_pend_cnt fields. · 7489e3fe
Tim Wickberg authored Feb 16, 2017
```
These were mis-calculated previously, and are internal implementation details
that weren't meant to be exposed.
```
7489e3fe