Commits · d85cdcc77c6acc5cf8ed234520a9c41e4d9de7df · Manuel G. Marciani / ces_slurm_simulator

16 Mar, 2016 2 commits

Send burst buffer teardown immediately · d85cdcc7

Morris Jette authored Mar 16, 2016

Generate burst buffer use completion email immediately afer teardown
    completes rather than at job purge time (likely minutes later).
bug 2539

d85cdcc7

Modify burst buffer stage out message · fae4c3d3

Morris Jette authored Mar 16, 2016

Change burst buffer use completion message from
"SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47" to
"SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47"

fae4c3d3

15 Mar, 2016 2 commits
- acct_gather_energy/ipmi - add threshold for message logging · 18608974
  Alejandro Sanchez authored Mar 15, 2016
  
  18608974
- Check that bb_state.tres_pos is set correctly to avoid overwriting CPU TRES. · 5708037d
  Tim Wickberg authored Mar 15, 2016
```
Bug 2543.
```
  5708037d
14 Mar, 2016 2 commits
- Add option for TopologyParam=NoInAddrAnyCtld to make the slurmctld listen · 775c46de
  Danny Auble authored Mar 14, 2016
```
on only one port like TopologyParam=NoInAddrAny does for everything else.
```
  775c46de
- FreeBSD - set_oom_adj is Linux-specific, stub out to avoid errors. · b3f2359f
  Tim Wickberg authored Mar 14, 2016
```
There's no /proc on *BSD, and BSD handles OOM in a completely different way.
```
  b3f2359f
11 Mar, 2016 1 commit

Fix job array step function printout. · 03d29e24

Tim Wickberg authored Mar 11, 2016

Return [0-100:2] formatting, rather than [0,2,4,6,8,...] when using
a step function.

Was inadvertantly broken in 14.11 with commit 5ffdca92.

Bug 2535.

03d29e24

10 Mar, 2016 1 commit
- Add NEWS for commit 3bb2e602 · a0be0dc5
  Morris Jette authored Mar 10, 2016
  
  a0be0dc5
09 Mar, 2016 2 commits

cray job requeue bug · fec5e03b

Morris Jette authored Mar 09, 2016

Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.

Specifically, each job has a "cleaning/cleaned" flag. Once a job
terminates, the cleaning flag is set, then after the job node health
check completes, the value gets set to cleaned. If the job is requeued,
on its second (or subsequent) termination, the select/cray plugin
is called to launch the NHC. The plugin sees the "cleaned" flag
already set, it then logs:
error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
and returns, never launching the NHC. Since the termination of the
job NHC triggers releasing job resources (CPUs, memory, and GRES),
those resources are never released for use by other jobs.

Bug 2384

fec5e03b

Correctly parse nids in slurmconfgen_smw.py · 88ccc111

David Gloe authored Mar 09, 2016

An error in slurmconfgen_smw.py caused it to parse the nic as the nid.
On some systems those values differ, causing the generated slurm.conf file to
be incorrect.

Bug 2532.

88ccc111

08 Mar, 2016 2 commits

Fix route/topology plugin to prevent segfault in sbcast. · 897c4b27

Bill Brophy authored Mar 08, 2016

route_p_split_hostlist was not thread-safe, and would cause
one of several segfaults depending on where in the initialization
code each thread was.

Bug 2495.

897c4b27

Fix displayed value for RoutePlugin. · 14c51e65
Tim Wickberg authored Mar 08, 2016
```
Was incorrectly displaying "(null)" even when loaded successfully.
```
14c51e65

05 Mar, 2016 1 commit
- Fixed double read lock on getting job's gres/tres. · b23a57cf
  Danny Auble authored Mar 04, 2016
  
  b23a57cf
04 Mar, 2016 1 commit
- Fix issue where steps weren't always getting the gres/tres involved. · b294f81b
  Danny Auble authored Mar 04, 2016
  
  b294f81b
03 Mar, 2016 4 commits

Fix issue with sbcast not doing a correct fanout. · 72f13426
Danny Auble authored Mar 03, 2016

72f13426
Fix getting reservations to database when database is down. · 5c43d754
Brian Christiansen authored Mar 03, 2016
```
Bug 2507
```
5c43d754

Increase step GRES variable size · 7f0bdc84

Morris Jette authored Mar 03, 2016

Step GRES value changed from type "int" to "int64_t" to support larger
values. Previous logic could fail in step allocation values over 32-bits.
Other GRES values are 64-bit.

7f0bdc84

Force close on exec on first 256 file descriptors when launching a · f502f1e5

Danny Auble authored Mar 02, 2016

slurmstepd to close potential open ones.

It was pointed out the slurmd using acct_gather_energy/ipmi links to
freeipmi which could possibly open /dev/ipmi0 without the close on exec
flag set as root while launching a step leaving it open in the users app.

What this does is sets the flag on the first 256 to mitigate the concern.

Reported by Maksym Planeta.

Bug 2506

f502f1e5

02 Mar, 2016 2 commits

Backfill scheduler to validate correct job partition · efd9d35e

Gary B Skouson authored Mar 02, 2016

Previous logic tested whatever the job's partition pointer indicated
rather than the partition we are trying to run the job in. This bug
was introduced in Slurm version 15.08.5, Nov 16, 2015, commit
94f0e948
bug 2499

efd9d35e

Remove a duplicate xmalloc · 2d5066e7
Thomas Cadeau authored Feb 26, 2016

2d5066e7

01 Mar, 2016 2 commits

Update NEWS as well. · a058ff4a
Tim Wickberg authored Mar 01, 2016

a058ff4a

Defer suspend until launch completes · 52fe3de1

Morris Jette authored Feb 29, 2016

Insure that a job is completely launched before trying to suspend it.
Previous logic would start suspend logic early in the life of the
slurmstepd process, after it's listening socket was open but before
the tasks were launched. This defers the suspend logic until after
all prologs and setup completes and the tasks are launched. This is
important in the case of gang scheduling, in which newly launched
jobs can be immediately suspended.
bug 2494

52fe3de1

26 Feb, 2016 2 commits
- Set correct reason when a QOS' MaxTresMins is violated. · 745568f2
  Danny Auble authored Feb 26, 2016
  
  745568f2
- Add not to slurm.conf man page about SallocDefaultCommand and TaskPlugins. · b5b349b0
  Tim Wickberg authored Feb 25, 2016
```
Add note to slurm.conf man page about setting "--cpu_bind=no" as part
of SallocDefaultCommand if a TaskPlugin is in use.
```
  b5b349b0
25 Feb, 2016 1 commit
- Fix issue where SocketsPerBoard didn't translate to Sockets when CPUS= · fcae2193
  Danny Auble authored Feb 24, 2016
```
was also given.
```
  fcae2193
24 Feb, 2016 5 commits
- Make it so scontrol update part qos= will take away a partition QOS from · 3a7470ae
  Danny Auble authored Feb 24, 2016
```
a partition.
```
  3a7470ae
- Make it possible to change CPUsPerTask with scontrol. · de28c13a
  Danny Auble authored Feb 24, 2016
```
This also reverts most of commit fa331e30 as well as commit bd9fa830
which would try to set the pn_min_cpus every time a job was updated.
If a job didn't request node counts then they were hosed.

This commit takes away the magic which was screwing things up.  Now the
person gets what they asked for without magic changing things.

Bug 2302
Bug 2742
Bug 2478
```
  de28c13a
- Fix issue where when updating a job the pn_min_cpus was updated · bd9fa830
  Danny Auble authored Feb 24, 2016
```
erroneously.
```
  bd9fa830
- BGQ - Tighter locks around structures when nodes/cables change state. · c5925f41
  Danny Auble authored Feb 23, 2016
  
  c5925f41
- BGQ - Remove redeclaration of job_read_lock. · fd3dedda
  Danny Auble authored Feb 23, 2016
  
  fd3dedda
23 Feb, 2016 1 commit

Fix issue with resizing jobs and limits not be kept track of correctly. · 92ac0dcd

Danny Auble authored Feb 22, 2016

This whole process could probably be done better by keeping track of
old values and new values and only calling one function instead of a
pre and post function, but that can probably wait for future generations
of the code as it works now and is probably adequate for the time being.

Bug 2352

92ac0dcd

19 Feb, 2016 2 commits

BurstBuffer/cray pre-run race condtition fix · e8959ae9

Morris Jette authored Feb 19, 2016

BurstBuffer/cray - Defer job cancellation or time limit while "pre-run"
    operation in progress to avoid inconsistent state due to multiple calls
    to job termination functions.
bug 2454

e8959ae9

Update NEWS for start of v15.08.9 · 4aa2e3c2
Morris Jette authored Feb 18, 2016

4aa2e3c2

18 Feb, 2016 5 commits

MYSQL - Avoid having multiple default accounts when a user is added to · eafc3f71
Danny Auble authored Feb 18, 2016
```
a new account and making it a default all at once.

Bug 2428
```
eafc3f71
Prevent segfault in acct_gather_energy_cray if data requested before set. · c6771b1e
Alejandro Sanchez authored Feb 18, 2016
```
Match acct_gather_energy/rapl plugin. Bug 2397.
```
c6771b1e

Add new assoc_limit_continue option to SchedulerParameters · 0615d809

Tim Wickberg authored Feb 18, 2016

Control whether the scheduler will continue to try to run jobs in
a partition if a higher priority job is stuck due to an association
limit.

Can cause starvation for larger jobs, but will improve throughput and
utilization for systems that have extensively divvyed up their resources
through association/QOS limits.

Bug 2388 and 2452.

0615d809

Fix inadequate locks when updating a partition's TRES. · b223e52c
Danny Auble authored Feb 18, 2016
```
Bug 2453
```
b223e52c

When determine if a job can fit into a TRES time limit after resources · 244b34e0

Tim Wickberg authored Feb 17, 2016

have been selected set the time limit appropriately if the job didn't
request one.

If the partition has no DefaultTime setting, and no time_limit was given
for the job, job_ptr->time_limit == NO_VAL.

With AccountingStorageEnforce=safe this will prevent jobs from ever
starting if the association has any limit set for CPUMins.
(NO_VAL * cpus is a very large number, but if no time_limit is given
anywhere that is what they get :))

Bug 2388.

244b34e0

17 Feb, 2016 2 commits
- Make sure count of boards are restored when slurmctld has option -R. · 0ac68394
  Danny Auble authored Feb 17, 2016
  
  0ac68394
- Fix sreport's truncation of columns with large TRES and not using · 39095f0d
  Danny Auble authored Feb 16, 2016
```
a parsing option.

Dynamically set columns widths based on largest number in column.
```
  39095f0d