Commits · c421e5ee56197582f2d95f5d8ec933eb5c00aa5f · Manuel G. Marciani / ces_slurm_simulator

23 May, 2019 12 commits
- Fix gres-per-task logic for gres not bound to sockets · 7db6fb1e
  Morris Jette authored May 22, 2019
```
If GRES are not bound to specific sockets in a multi-socket
node then the sock_gres->sock_cnt variable will be zero and
find no usable GRES on a node.

Bug 7095
```
  7db6fb1e
- In multi-node systems make sure GRES are found on node when not bound to · aad94c35
  Moe Jette authored May 23, 2019
```
specific sockets.

Bug 7019
```
  aad94c35
- Revert "In multi-node systems make sure GRES are found on node when not bound to" · e95b599f
  Danny Auble authored May 23, 2019
```
This reverts commit 9cd7e5f4.
```
  e95b599f
- In multi-node systems make sure GRES are found on node when not bound to · 9cd7e5f4
  Moe Jette authored May 23, 2019
```
specific sockets.

Bug 7095
```
  9cd7e5f4
- Update 19.05.0rc2 NEWS with 18.08.8 NEWS · 163897af
  Brian Christiansen authored May 23, 2019
```
Continuations of 45bfc4dc

Bug 6926
```
  163897af
- Add 18.08.8 NEWS lines to 19.05.0rc2 NEWS · d5b7b956
  Brian Christiansen authored May 23, 2019
```
Commits:
c2bc255c
f591f0c9
```
  d5b7b956
- Fix sacctmgr --parsable2 output for reservations and tres · 45bfc4dc
  Dominik Bartkiewicz authored Apr 30, 2019
```
Bug 6926
```
  45bfc4dc
- Fix error messages in _convert_to_name(). · 288a8e9d
  Felip Moll authored May 23, 2019
```
The name variable hasn't been set yet, so this is always NULL. Print the
uid/gid instead. While here, treat uid/gid as uint32_t, and use strtoul()
rather than atoi() to avoid issues with high-number uid/gid values.

Fixes GCC 9 warning.

Bug 7101.
```
  288a8e9d
- Make it so dependent jobs reset the AccrueTime and do not count against any AccrueTime limits. · c2bc255c
  Alejandro Sanchez authored May 22, 2019
```
Continuation of 89b791bf.

Bug 7045.
```
  c2bc255c
- Add new job bit_flags of JOB_DEPENDENT. · f591f0c9
  Alejandro Sanchez authored May 21, 2019
```
To indicate that a job is dependent or has an invalid dependency.
Not used for now, just added and removed according to its meaning.

Bug 7045.
```
  f591f0c9
- Fix packing pack_jobid in an sbcast. · adb730b4
  Albert Gil authored May 23, 2019
```
After 1d66b395 18.08 and 17.11 are the same so we can just reuse the
18.08 block instead of making a new one.

Bug 7080
```
  adb730b4
- Fix issue with a 17.11 sbcast call to a 18.08 daemon. · 1d66b395
  Albert Gil authored May 23, 2019
```
Bug 7080
```
  1d66b395
22 May, 2019 5 commits

Add 19.05 NEWS line for e7d4d593 from 18.08 · 5bbc2543
Brian Christiansen authored May 22, 2019
```
Bug 6467
```
5bbc2543

Use correct rank for cloud stepd's. · e7d4d593

Marshall Garey authored Apr 18, 2019

Job steps that run on cloud nodes and use the alias_list - in other
words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk
directly back to the slurmctld. To make that happen, we set the parent
tank of each stepd to -1. However, we also set the rank of each stepd to
0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to
the slurmctld, they would tell slurmctld to clean up node 0 in the step
allocation. So, multi-node step allocations weren't cleaning up after
the steps completed and would cause subsequent job steps to hang. The
step allocations would only clean up properly at the end of the job.

Ensure that each stepd uses the correct rank so that job steps are
properly cleaned up after each step completes.

Bug 6467.

e7d4d593

Copy two 18.08 NEWS entries to 19.05. · a7084228
Alejandro Sanchez authored May 22, 2019
```
They were associated to these two commits:

b4d7de48
6871185a

Bug 5562.
```
a7084228
Move two NEWS entries to appropriate maintenance release. · 09a7da34
Alejandro Sanchez authored May 22, 2019
```
They were associated to these two commits:

b4d7de48
6871185a

Bug 5562.
```
09a7da34
cons_tres/dist_tasks - fix variable usage in cyclic distribution. · abb732c8
Morris Jette authored May 22, 2019
```
Bug 6998.
```
abb732c8

21 May, 2019 8 commits
- job_preempt_check() - consider only jobs in an overlapping partition · 2d309f2e
  Dominik Bartkiewicz authored May 06, 2019
```
Bug 6822
```
  2d309f2e
- Change index use for lower overhead and better clarity · 23c827d9
  Moe Jette authored May 21, 2019
```
Bug 7061
```
  23c827d9
- Add 18.08.8 NEWS to 19.05.9rc2 NEWS · 09ec07ef
  Brian Christiansen authored May 21, 2019
  
  09ec07ef
- Correctly set unlimited sched_job_limit · 69621444
  Dominik Bartkiewicz authored May 06, 2019
```
unlimited could get overwritten with default queue depth preventing the
whole queue from being looked at -- especially in a high-throughput
envrionment.

Bug 6822

Co-authored-by: Morris Jette <jette@schedmd.com>
```
  69621444
- cons_res/job_test - fix to consider a node's current allocated memory. · b4d7de48
  Alejandro Sanchez authored Apr 11, 2019
```
Node memory overallocation wouldn't be properly detected since we would
just be interpreting the available memory as RealMemory - MemSpecLimit,
ignoring other job's memory usage.

Bug 5562.
```
  b4d7de48
- cons_res/job_test - prevent a job from overallocating a node memory. · 6871185a
  Alejandro Sanchez authored Apr 11, 2019
```
This compares a job memory request against each selected node available
memory, interpreting the latter for now as RealMemory - MemSpecLimit.

Bug 5562.
```
  6871185a
- Fix wrongly setting start_time to 0 for multi-part jobs. · 457e7517
  Dominik Bartkiewicz authored May 21, 2019
```
Bug 6508
```
  457e7517
- Fix DefMemPer[CPU|Node] assignment on multi-partition job requests. · 8a1e5a52
  Alejandro Sanchez authored May 09, 2019
```
Previously when no memory was explicitly requested the job was assigned
the DefMemPer[CPU|Node] from the first partition in the list (or the
cluster-wide value if the partition wasn't configured with it), even
when evaluating against a different partition.

Bug 6950.
```
  8a1e5a52
17 May, 2019 2 commits
- Fix NEWS from previous commit. · 438ffc1c
  Tim Wickberg authored May 16, 2019
```
This is select/cons_res, not select/cons_tres.
```
  438ffc1c
- Only allocate 1 CPU per node with the --overcommit and --nodelist options · 46197135
  Morris Jette authored May 10, 2019
```
Previous select/cons_res logic would allocate one CPU per task on the node

Bug 6981
```
  46197135
16 May, 2019 5 commits

Only allocate 1 CPU per node with the --overcommit option · dd7775ef
Morris Jette authored May 10, 2019
```
Previous select/cons_tres logic would allocate one CPU per task on the node

Bug 6981
```
dd7775ef

modify task layout with --overcommit · 42d7e312

Morris Jette authored May 10, 2019

Modify task layout with --overcommit option plus a heterogeneous job
allocation so that a cyclic task distribution can start happening before
all CPUs on all nodes are fully allocated. The number of tasks per node
will be unchanged from the previous algorithm, but tasks will be distributed
in a cyclic fashion first and then extra tasks placed on nodes with more
CPUs. Previously all CPUs would be fully allocated in a cyclic fashion,
then excess tasks distributed evenly across all allocated nodes.
Bug 6981

42d7e312

Store reservation flags in slurmdbd in a uint64_t. · 46d55dd4

Dominik Bartkiewicz authored May 16, 2019

Add warning to slurm.h.in that no new reservation flags can be
stored in slurmdbd in 19.05. (Although they could still be used by
slurmctld without issue.)

Note that the underlying RPC still uses uint32_t, but this will be
changed before 20.02 on master, and changing the column to uint32_t
in 19.05 just to change it again in 20.02 is best avoided.

Bug 6969.

46d55dd4

Fix memory leaks due to incomplete slurmdb_cluster_cond_t destructor. · 2038469f
Nathan Rini authored May 16, 2019
```
Free format_list, plugin_id_select_list, rpc_version_list in
 _free_cluster_cond_members().

Bug 7020.
```
2038469f

Fix archive loading events. · 0d0f9deb

Marshall Garey authored May 15, 2019

There was a syntax error in the mysql for inserting the event records
into the event table caused by commit 3d61b6aa. The syntax error was
a semicolon in the middle of the query, for example:

insert into "voyager_event_table" (time_start, time_end, node_name,
cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
'1539298628', 'v1', '', 'cold-start', '1017', '0',
'1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...

Bug 7025.

0d0f9deb

15 May, 2019 1 commit

Avoid call to slurm_get_slurmd_user_id() in _step_connect() if not slurmd. · 0a4c5234

Tim Wickberg authored May 15, 2019

For a stray socket, this call would cause nss_slurm to deadlock,
as any calling path that leads to slurm_conf_lock(), which will call
getpwuid(), which will re-enter the nss_slurm code, which will end up
back here but with the slurm_conf_lock already held, at which point
the process will never continue.

For nss_slurm, this means a node rebooting with stale sockets will hang
in the middle of the init process, which is a rather unpleasant experience.

So - only handle the stray socket cleanup within the slurmd process itself.

Bug 7030

0a4c5234

13 May, 2019 1 commit
- Remove stray newlines in SPANK error messages. · 3c68e645
  Tim Wickberg authored May 13, 2019
  
  3c68e645
10 May, 2019 3 commits

Prevent leak of cluster_str in sacctmgr_list_runaway_jobs(). · bb9d5e79
Nate Rini authored May 06, 2019
```
Bug 6952.
```
bb9d5e79

Only archive 50k records at a time. · ddd49896

Marshall Garey authored Apr 24, 2019

Trying to archive too many records at once can result in archive files
that are too big to read or even too big to be written. Only archive 50k
records at a time, like we only purge 50k records at a time.

Bug 6033.

ddd49896

Handle duplicate archive file names. · 1e234c3d

Marshall Garey authored Apr 24, 2019

The time period of the archive file currently depends on submit or start
time and whether the purge period is in hours, days, or months.
Previously, if the archive file name already exists, we would overwrite
the old archive file with the assumption that these are duplicate
records being archived after an archive load. However, that could result
in lost records in a couple of ways:

  * If there were runaway jobs that were part of an old archive file's
  time period and are later fixed and then purged, the old file would
  be overwritten.
  * If jobs or steps are purged but there are still jobs or steps in
  that time period that are pending or running, the pending or running
  jobs and steps won't be purged. When they finish and are purged, the
  old file would be overwritten.

Instead of overwriting the old file, we append a number to the file name
to create a new file. This will also be important in an upcoming commit.

Bug 6033.

1e234c3d

08 May, 2019 1 commit
- remove double occurrence of "the" in comments and docs · 1d31fe95
  Bas Nijholt authored May 07, 2019
  
  1d31fe95
07 May, 2019 2 commits
- Add missing locks around slurmctld_config.server_thread_count. · 94b81685
  Alejandro Sanchez authored May 07, 2019
```
Reported as conflicting thread load operations by valgrind --tool=drd.

Bugs 6189 and 4159.
```
  94b81685
- Revert "Add missing locks protecting slurmctld_config.server_thread_count access." · b82e9531
  Alejandro Sanchez authored May 07, 2019
```
This reverts commit f3d678d4.
```
  b82e9531