Commits · 97ed25445403cb14f5d72fba4d89ae135e018473 · Manuel G. Marciani / ces_slurm_simulator

01 Oct, 2019 1 commit

Felip Moll authored Aug 06, 2019

Increase the maximum array len large to be packed/unpacked with one order of
magnitude, since the current value demonstrated it is not enough when an MPI
program spawns a considerable amount of tasks over a big set of nodes.

This limit was introduced in 627928f4.

Bug 7495

f7bed728

30 Sep, 2019 4 commits

Disallow coordinators to show/fix runaways. · 3cf4418b
Danny Auble authored Sep 25, 2019
```
There was never any security to allow for this,
so we are just removing it.

Bug 7765
```
3cf4418b

Allow Operator users to show/fix runaways · 933affa7

Albert Gil authored Sep 20, 2019

Admin/Operator users were not able to skip MaxQueryTimeRange when trying
to show/fix runaway jobs.
This commit uses _validate_operator() instead of
_validate_slurm_user() in _get_jobs_cond() as well as check for operators
in _fix_runaway_jobs().

Bug 7765

933affa7

Fix memory leaks when using multiple features and preemption · c2a57967
Dominik Bartkiewicz authored Sep 16, 2019
```
Bug 7708
```
c2a57967
Fix preemption jobs with complex features. · f2fcf3af
Dominik Bartkiewicz authored Sep 16, 2019
```
Don't remove jobs from preemptee_candidates List.

Bug 7708
```
f2fcf3af

26 Sep, 2019 3 commits
- nss_slurm - fix file descriptor leaks. · d491a42d
  Georg Rath authored Sep 26, 2019
```
Since this happens inside a the user process, this can inadvertently
cause the user's job to die by running out of file descriptors.

Bug 7814.

Co-authored-by: William Arndt <warndt@lbl.gov>
```
  d491a42d
- Do not remove batch host when resizing/shrinking a batch job · 18279136
  Marshall Garey authored Sep 25, 2019
```
Bug 7499
```
  18279136
- Fix sdiag backfill last and mean queue length stats. · 4c8a10df
  Dominik Bartkiewicz authored Sep 02, 2019
```
Regression introduced in fb26b706.

Bug 7675
```
  4c8a10df
25 Sep, 2019 1 commit

Fix scancel --full for proctrack/cgroups · 4dfb3ad6

Albert Gil authored Sep 12, 2019

Now the signaling of the batch step and the handeling of the flags is totally
handled in _kill_all_active_steps() in slurmd, and _handle_signal_container()
in stepd to ensure that:
- if KILL_JOB_BATCH then only batch container is signaled
- if KILL_FULL_JOB then batch script and its children are also signaled
- if both of the above then only the batch script and its children are signaled

We do not relay anymore on proctrack_g_signal() to handle the batch step
signaling anymore, therefore it works the same for all proctrack plugins.

This commit also includes minor related fixes in other code handling such
signaling flags, and documentation improvement.

Bug 7282

4dfb3ad6

23 Sep, 2019 1 commit
- Remove inadvertent duplicate NEWS entry. · 23bafc8f
  Tim Wickberg authored Sep 23, 2019
  
  23bafc8f
20 Sep, 2019 2 commits
- Fix uninitialized errors when compiling with CFLAGS="--coverage" · f16b1c3b
  Brian Christiansen authored Sep 05, 2019
```
Signed-off-by: Tim Wickberg <tim@schedmd.com>

Bug 7697
```
  f16b1c3b
- Add NEWS for previous commit · c90b6834
  Michael Hinton authored Sep 20, 2019
```
1cd43fce

Bug 7630
```
  c90b6834
16 Sep, 2019 1 commit
- Remove redefinition of global variable in gres.c. · 2abd2a3d
  Robert Tweedy authored Sep 16, 2019
```
Bug 7727

This was missed in commit 6ac4ce84.
```
  2abd2a3d
12 Sep, 2019 3 commits
- select/cons_tres - fix gres code infinite loop. · c15f0f3c
  Marcin Stolarek authored Sep 06, 2019
```
An incorrect logic with the variables holding available cores in the
gres_plugin_job_core_filter3() function lead to a potential infinite
"while (avail_cores_tot > req_cores)" loop, leaving slurmctld unresponsive.

Bug 7685.
```
  c15f0f3c
- Fix race condition preventing held array job from getting db_index · 6d6711b3
  Brian Christiansen authored Sep 11, 2019
```
Bug 7719

Signed-off-by: Danny Auble <da@schedmd.com>
```
  6d6711b3
- Fix preemption issue when picking nodes for a feature job request. · 0d432cae
  Dominik Bartkiewicz authored Sep 11, 2019
```
Regression caused by 72736af2.

Bug 7708.
```
  0d432cae
10 Sep, 2019 1 commit

Deprecate FastSchedule. · 2c44fcf6

Danny Auble authored Sep 05, 2019



FastSchedule will be removed in 20.02.
FastSchedule=2 functionality has been moved to
SlurmdParameters=config_overrides.

Bug 7496.


Signed-off-by: Tim Wickberg <tim@schedmd.com>

2c44fcf6

06 Sep, 2019 2 commits
- Correct "extern" definition of variable if compiling with __APPLE__ · 25e6f4d8
  Brian Christiansen authored Sep 06, 2019
```
Bug 7699
```
  25e6f4d8
- Fix checking for flag with logical AND · 9083eed2
  Danny Auble authored Sep 06, 2019
```
Continuation of 64876087

Bug 7698
```
  9083eed2
04 Sep, 2019 3 commits
- Have safe_[read|write] handle EAGAIN and EINTR. · 9c0a206f
  Danny Auble authored Sep 04, 2019
```
Bug 4781
```
  9c0a206f
- sched/backfill - clear estimated sched_nodes as done for start_time. · 32feee0b
  Dominik Bartkiewicz authored Sep 04, 2019
```
Otherwise, there could be time frames where printed schednodes
information could be obsolete.

Bug 7676.
```
  32feee0b
- Properly enforce a job's mem-per-cpu option when allocate the node · 1c22ed8b
  Dominik Bartkiewicz authored Aug 26, 2019
```
exclusively to that job.

Bug 7510
```
  1c22ed8b
03 Sep, 2019 4 commits
- Fix create_resv() · c55f6d65
  Dominik Bartkiewicz authored Aug 06, 2019
```
use correct start_time for TIME_FLOAT resevation
in _job_overlap()

Bug 7458
```
  c55f6d65
- Set mising resv_desc.flags before call _select_nodes() · 6d89c126
  Dominik Bartkiewicz authored Jul 31, 2019
```
Bug 7458
```
  6d89c126
- look forward one week while create new reservation · d72a02c6
  Dominik Bartkiewicz authored Jul 30, 2019
```
Bug 7458
```
  d72a02c6
- Fix job_resv_check() · 10043b8a
  Dominik Bartkiewicz authored Jul 26, 2019
```
Move _validate_node_choice() before prolog/epilog check

Bug 7458
```
  10043b8a
29 Aug, 2019 4 commits

gres/mic - add missing init() and fini() calls. · 45406061
Michael Hinton authored Aug 29, 2019
```
Free the gres_devices list to avoid a valgrind warning on exit.

Bug 7644.
```
45406061
Fix getting batch_host after its been set when requesting --batch · fe2bf9bf
Brian Christiansen authored Aug 21, 2019
```
Continuation of 30bbc11d



Bug 7445

Signed-off-by: Dominik Bartkiewicz <bart@schedmd.com>
```
fe2bf9bf

Make --batch requests wait for all nodes to boot before launching · 76956c87

Brian Christiansen authored Aug 20, 2019



When --batch=<feature> is used, the batch_host isn't chosen until the
job is being launched -- because the features could be different on boot
(e.g. KNL nodes). Thus if the job is allocated nodes that need to be
booted, it needs to wait till they are all booted so it can make a
decision at launch time.

Bug 7445

Signed-off-by: Dominik Bartkiewicz <bart@schedmd.com>

76956c87

Don't assume the first node of a job is the batch host · 875cbf9c
Dominik Bartkiewicz authored Jun 18, 2019
```
This is a continuation to 7da439b4

Bug 7445
```
875cbf9c

28 Aug, 2019 1 commit

Don't update [min|max]_exit_code on job array task requeue. · 0e42eb87

Alejandro Sanchez authored Aug 28, 2019

Only do so once the task actually finishes. Otherwise, a requeued task
could set an incorrect max_exit_code even if completed with exit code 0
after re-running again, leading to problems with i.e. other jobs with an
afterok type of dependency on such array relying on the incorrectly set
max_exit_code.

Bug 7552.

0e42eb87

23 Aug, 2019 1 commit

valid_feature_counts should not take care about XOR/XAND features · 1c051c61

Marcin Stolarek authored Aug 01, 2019

In case of features like cpu&fastio&[knl|westmere] additional bit_or
resulted in returning something like (cpu&fastio)|knl|westmere, which
is obviously wrong. XOR/XAND features are handled properly in
_get_req_features.

Bug 7378

1c051c61

20 Aug, 2019 2 commits

Handle situation where a slurmctld tries to communicate with slurmdbd more... · af7b4531

Danny Auble authored Aug 12, 2019

Handle situation where a slurmctld tries to communicate with slurmdbd more than once at the same time.

What can happen here is the slurmdbd/slurmctld connection gets hung up
somehow. If the slurmctld is restarted a new connection is made along
side the old connection. When the old connection gets unwedged the old
connection will clear out the registration of the slurmctld making it so
no updates are sent to that slurmctld.

What this does is checks for old connections when a registration message
comes in. If we find one we print error set the rem_port = 0 and
remove it from the list. This makes it so when it gets unwedged we just
close the socket instead of remove the registration.

Bug 5213

af7b4531

Fix NEWS entry for the previous commit a04eea2e. · d0729247
Alejandro Sanchez authored Aug 20, 2019
```
Bug 7360.
```
d0729247

19 Aug, 2019 2 commits

Detach threads once they are done to avoid having to join them · a04eea2e
Danny Auble authored Aug 19, 2019
```
in track scripts code.

Bug 7360


Signed-off-by: Alejandro Sanchez <alex@schedmd.com>
```
a04eea2e

Fix unaccounted TRESRunMins usage from HetJobs · 1da9c5d0

Broderick Gardner authored Jun 26, 2019

The implementation of priority_p_job_end in priority/multifactor
expects the job state to be set to complete or completing in order to
properly remove some job usage from the assoc and qos. This must be
simulated by the pack job run check code, or the check-time usage is not
removed.

Bug 7284

1da9c5d0

16 Aug, 2019 1 commit
- job_submit/lua - fix problem where nil was expected for min_mem_per_cpu. · 2d017875
  Chad Vizino authored Aug 12, 2019
```
It wasn't properly set under certain conditions.

Bug 7276
```
  2d017875
15 Aug, 2019 1 commit
- Cray - fix contribs slurm.conf.j2 with updated cray_aries plugin names. · e945917d
  Marcin Stolarek authored Aug 15, 2019
```
Bug 7410.
```
  e945917d
14 Aug, 2019 2 commits

COMPLETING nodes available immediately for job will-run test · 0666db61

Morris Jette authored Jun 12, 2019

Consider jobs in COMPLETING state as being available immediatley for
a job will-run evaluation. This assumes the completion will happen
very soon after the test is run.

bug 6769

0666db61

Avoid select plugin resource usage underflow from duplicate job free · 2dd1f448

Morris Jette authored Jul 29, 2019

All of the select plugins were performing a duplicate resource free
for jobs in completing state when performing a will-run test for
new jobs. This would frequently result in underflow messages.

Bug 6769

2dd1f448