Commits · adb730b4c45caa97f905cbecd2936d6633be836a · Manuel G. Marciani / ces_slurm_simulator

23 May, 2019 2 commits
- Fix packing pack_jobid in an sbcast. · adb730b4
  Albert Gil authored May 23, 2019
```
After 1d66b395 18.08 and 17.11 are the same so we can just reuse the
18.08 block instead of making a new one.

Bug 7080
```
  adb730b4
- Fix issue with a 17.11 sbcast call to a 18.08 daemon. · 1d66b395
  Albert Gil authored May 23, 2019
```
Bug 7080
```
  1d66b395
22 May, 2019 5 commits

Add 19.05 NEWS line for e7d4d593 from 18.08 · 5bbc2543
Brian Christiansen authored May 22, 2019
```
Bug 6467
```
5bbc2543

Use correct rank for cloud stepd's. · e7d4d593

Marshall Garey authored Apr 18, 2019

Job steps that run on cloud nodes and use the alias_list - in other
words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk
directly back to the slurmctld. To make that happen, we set the parent
tank of each stepd to -1. However, we also set the rank of each stepd to
0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to
the slurmctld, they would tell slurmctld to clean up node 0 in the step
allocation. So, multi-node step allocations weren't cleaning up after
the steps completed and would cause subsequent job steps to hang. The
step allocations would only clean up properly at the end of the job.

Ensure that each stepd uses the correct rank so that job steps are
properly cleaned up after each step completes.

Bug 6467.

e7d4d593

Copy two 18.08 NEWS entries to 19.05. · a7084228
Alejandro Sanchez authored May 22, 2019
```
They were associated to these two commits:

b4d7de48
6871185a

Bug 5562.
```
a7084228
Move two NEWS entries to appropriate maintenance release. · 09a7da34
Alejandro Sanchez authored May 22, 2019
```
They were associated to these two commits:

b4d7de48
6871185a

Bug 5562.
```
09a7da34
cons_tres/dist_tasks - fix variable usage in cyclic distribution. · abb732c8
Morris Jette authored May 22, 2019
```
Bug 6998.
```
abb732c8

21 May, 2019 8 commits
- job_preempt_check() - consider only jobs in an overlapping partition · 2d309f2e
  Dominik Bartkiewicz authored May 06, 2019
```
Bug 6822
```
  2d309f2e
- Change index use for lower overhead and better clarity · 23c827d9
  Moe Jette authored May 21, 2019
```
Bug 7061
```
  23c827d9
- Add 18.08.8 NEWS to 19.05.9rc2 NEWS · 09ec07ef
  Brian Christiansen authored May 21, 2019
  
  09ec07ef
- Correctly set unlimited sched_job_limit · 69621444
  Dominik Bartkiewicz authored May 06, 2019
```
unlimited could get overwritten with default queue depth preventing the
whole queue from being looked at -- especially in a high-throughput
envrionment.

Bug 6822

Co-authored-by: Morris Jette <jette@schedmd.com>
```
  69621444
- cons_res/job_test - fix to consider a node's current allocated memory. · b4d7de48
  Alejandro Sanchez authored Apr 11, 2019
```
Node memory overallocation wouldn't be properly detected since we would
just be interpreting the available memory as RealMemory - MemSpecLimit,
ignoring other job's memory usage.

Bug 5562.
```
  b4d7de48
- cons_res/job_test - prevent a job from overallocating a node memory. · 6871185a
  Alejandro Sanchez authored Apr 11, 2019
```
This compares a job memory request against each selected node available
memory, interpreting the latter for now as RealMemory - MemSpecLimit.

Bug 5562.
```
  6871185a
- Fix wrongly setting start_time to 0 for multi-part jobs. · 457e7517
  Dominik Bartkiewicz authored May 21, 2019
```
Bug 6508
```
  457e7517
- Fix DefMemPer[CPU|Node] assignment on multi-partition job requests. · 8a1e5a52
  Alejandro Sanchez authored May 09, 2019
```
Previously when no memory was explicitly requested the job was assigned
the DefMemPer[CPU|Node] from the first partition in the list (or the
cluster-wide value if the partition wasn't configured with it), even
when evaluating against a different partition.

Bug 6950.
```
  8a1e5a52
17 May, 2019 2 commits
- Fix NEWS from previous commit. · 438ffc1c
  Tim Wickberg authored May 16, 2019
```
This is select/cons_res, not select/cons_tres.
```
  438ffc1c
- Only allocate 1 CPU per node with the --overcommit and --nodelist options · 46197135
  Morris Jette authored May 10, 2019
```
Previous select/cons_res logic would allocate one CPU per task on the node

Bug 6981
```
  46197135
16 May, 2019 5 commits

Only allocate 1 CPU per node with the --overcommit option · dd7775ef
Morris Jette authored May 10, 2019
```
Previous select/cons_tres logic would allocate one CPU per task on the node

Bug 6981
```
dd7775ef

modify task layout with --overcommit · 42d7e312

Morris Jette authored May 10, 2019

Modify task layout with --overcommit option plus a heterogeneous job
allocation so that a cyclic task distribution can start happening before
all CPUs on all nodes are fully allocated. The number of tasks per node
will be unchanged from the previous algorithm, but tasks will be distributed
in a cyclic fashion first and then extra tasks placed on nodes with more
CPUs. Previously all CPUs would be fully allocated in a cyclic fashion,
then excess tasks distributed evenly across all allocated nodes.
Bug 6981

42d7e312

Store reservation flags in slurmdbd in a uint64_t. · 46d55dd4

Dominik Bartkiewicz authored May 16, 2019

Add warning to slurm.h.in that no new reservation flags can be
stored in slurmdbd in 19.05. (Although they could still be used by
slurmctld without issue.)

Note that the underlying RPC still uses uint32_t, but this will be
changed before 20.02 on master, and changing the column to uint32_t
in 19.05 just to change it again in 20.02 is best avoided.

Bug 6969.

46d55dd4

Fix memory leaks due to incomplete slurmdb_cluster_cond_t destructor. · 2038469f
Nathan Rini authored May 16, 2019
```
Free format_list, plugin_id_select_list, rpc_version_list in
 _free_cluster_cond_members().

Bug 7020.
```
2038469f

Fix archive loading events. · 0d0f9deb

Marshall Garey authored May 15, 2019

There was a syntax error in the mysql for inserting the event records
into the event table caused by commit 3d61b6aa. The syntax error was
a semicolon in the middle of the query, for example:

insert into "voyager_event_table" (time_start, time_end, node_name,
cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
'1539298628', 'v1', '', 'cold-start', '1017', '0',
'1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...

Bug 7025.

0d0f9deb

15 May, 2019 1 commit

Avoid call to slurm_get_slurmd_user_id() in _step_connect() if not slurmd. · 0a4c5234

Tim Wickberg authored May 15, 2019

For a stray socket, this call would cause nss_slurm to deadlock,
as any calling path that leads to slurm_conf_lock(), which will call
getpwuid(), which will re-enter the nss_slurm code, which will end up
back here but with the slurm_conf_lock already held, at which point
the process will never continue.

For nss_slurm, this means a node rebooting with stale sockets will hang
in the middle of the init process, which is a rather unpleasant experience.

So - only handle the stray socket cleanup within the slurmd process itself.

Bug 7030

0a4c5234

13 May, 2019 1 commit
- Remove stray newlines in SPANK error messages. · 3c68e645
  Tim Wickberg authored May 13, 2019
  
  3c68e645
10 May, 2019 3 commits

Prevent leak of cluster_str in sacctmgr_list_runaway_jobs(). · bb9d5e79
Nate Rini authored May 06, 2019
```
Bug 6952.
```
bb9d5e79

Only archive 50k records at a time. · ddd49896

Marshall Garey authored Apr 24, 2019

Trying to archive too many records at once can result in archive files
that are too big to read or even too big to be written. Only archive 50k
records at a time, like we only purge 50k records at a time.

Bug 6033.

ddd49896

Handle duplicate archive file names. · 1e234c3d

Marshall Garey authored Apr 24, 2019

The time period of the archive file currently depends on submit or start
time and whether the purge period is in hours, days, or months.
Previously, if the archive file name already exists, we would overwrite
the old archive file with the assumption that these are duplicate
records being archived after an archive load. However, that could result
in lost records in a couple of ways:

  * If there were runaway jobs that were part of an old archive file's
  time period and are later fixed and then purged, the old file would
  be overwritten.
  * If jobs or steps are purged but there are still jobs or steps in
  that time period that are pending or running, the pending or running
  jobs and steps won't be purged. When they finish and are purged, the
  old file would be overwritten.

Instead of overwriting the old file, we append a number to the file name
to create a new file. This will also be important in an upcoming commit.

Bug 6033.

1e234c3d

08 May, 2019 1 commit
- remove double occurrence of "the" in comments and docs · 1d31fe95
  Bas Nijholt authored May 07, 2019
  
  1d31fe95
07 May, 2019 3 commits
- Add missing locks around slurmctld_config.server_thread_count. · 94b81685
  Alejandro Sanchez authored May 07, 2019
```
Reported as conflicting thread load operations by valgrind --tool=drd.

Bugs 6189 and 4159.
```
  94b81685
- Revert "Add missing locks protecting slurmctld_config.server_thread_count access." · b82e9531
  Alejandro Sanchez authored May 07, 2019
```
This reverts commit f3d678d4.
```
  b82e9531
- Add missing locks protecting slurmctld_config.server_thread_count access. · f3d678d4
  Alejandro Sanchez authored Jan 10, 2019
```
Reported as conflicting thread load operations by valgrind --tool=drd.

Bugs 6189 and 4159.
```
  f3d678d4
06 May, 2019 1 commit

Fix seff memory display overflow · bab13dfd

Felip Moll authored Apr 15, 2019

When tres_usage_in_max field is empty it is recorded as '' in the database
which leads find_tres_count_in_string() to return an INFINITE64. Seff treats
INIFINITE64 as a valid value. This patch fixes this issue.

Bug 6817

bab13dfd

03 May, 2019 3 commits
- Free memory before exiting in sacctmgr_list_runaway_jobs(). · f56fc717
  Nate Rini authored May 03, 2019
```
Bug 6880/6952.
```
  f56fc717
- Ignore screen part of local DISPLAY in x11_get_display(). · e87ed60e
  Dominik Bartkiewicz authored May 03, 2019
```
Bug 6959.
```
  e87ed60e
- Add timers to site_factor plugin API calls to warn about slow plugins. · 2097c159
  Nate Rini authored May 02, 2019
```
Bug 6944.
```
  2097c159
02 May, 2019 5 commits
- NVML - remove unneeded {}. · 36e2a615
  Danny Auble authored May 02, 2019
  
  36e2a615
- NVML - Fix clang warning about unneeded variable initialization. · 7894dd83
  Broderick Gardner authored Apr 08, 2019
```
This is the same because xstrdup returns null on null.
Bug 6812
```
  7894dd83
- NVML - Git rid of unneeded * when passing nvmlDevice_t to functions. · 55575ad3
  Danny Auble authored May 02, 2019
```
No real code change.
```
  55575ad3
- Fix deprecated group by clause to use order by. · 4bc94a6d
  Tim Wickberg authored May 02, 2019
```
It appears this is really what this was suppose to be anyway.

Bug 5950
```
  4bc94a6d
- Fix resubmit to sibling default on fed requeue · 822fe77e
  Broderick Gardner authored Apr 18, 2019
```
On requeue, the origin cluster job record is copied to submit
to sibling clusters. If the job was originally submitted
to accept cluster default account, partition, etc, those fields
are now filled in on the origin. Here we add flags to indicate
that those fields need to be cleared on resubmission to siblings.
Bug 6064
```
  822fe77e