Commits · d30ab20193fdc37f9562bd7f4240e67e32d46013 · Manuel G. Marciani / ces_slurm_simulator

23 May, 2019 20 commits
- Store event's node state as uint32_t · f9f78f8e
  Brian Christiansen authored May 13, 2019
```
Node state is 32bit. Have to wait till 20.02 to change packing routines.

See 845ff7d4

Bug 6964
```
  f9f78f8e
- Don't write "(null)" to event table · 63ac136a
  Brian Christiansen authored May 16, 2019
```
Bug 6964
```
  63ac136a
- Fix reboot failure message getting to event database · ee9db31c
  Brian Christiansen authored May 16, 2019
```
The reason was being set after the message was sent to the db. Also
clear the draing and reboot states before the message is sent so that
the event state will show DOWN.

Bug 6964
```
  ee9db31c
- Create node reboot event in database · f84f60d0
  Brian Christiansen authored May 16, 2019
```
Bug 6964
```
  f84f60d0
- On reboot ASAP clear node from avail_node_bitmap · 7fc23eff
  Brian Christiansen authored May 16, 2019
```
so that new jobs can't get on the node.

Bug 6964
```
  7fc23eff
- Fix issue when --gpus plus --cpus-per-gres was forcing socket binding · 94c4aeab
  Morris Jette authored May 23, 2019
```
unnecessarily.

Bug 7106
```
  94c4aeab
- Simplify code to work as it does in other places (35ef5386). · 123cb8e2
  Danny Auble authored May 14, 2019
```
Bug 6927
```
  123cb8e2
- Prevent slurmctld from potential segfault after job_start_data() called · 3238e02d
  Dominik Bartkiewicz authored Apr 30, 2019
```
for completing job.

Bug 6927
```
  3238e02d
- Fix gres-per-task logic for gres not bound to sockets · 7db6fb1e
  Morris Jette authored May 22, 2019
```
If GRES are not bound to specific sockets in a multi-socket
node then the sock_gres->sock_cnt variable will be zero and
find no usable GRES on a node.

Bug 7095
```
  7db6fb1e
- In multi-node systems make sure GRES are found on node when not bound to · aad94c35
  Moe Jette authored May 23, 2019
```
specific sockets.

Bug 7019
```
  aad94c35
- Revert "In multi-node systems make sure GRES are found on node when not bound to" · e95b599f
  Danny Auble authored May 23, 2019
```
This reverts commit 9cd7e5f4.
```
  e95b599f
- In multi-node systems make sure GRES are found on node when not bound to · 9cd7e5f4
  Moe Jette authored May 23, 2019
```
specific sockets.

Bug 7095
```
  9cd7e5f4
- Update 19.05.0rc2 NEWS with 18.08.8 NEWS · 163897af
  Brian Christiansen authored May 23, 2019
```
Continuations of 45bfc4dc

Bug 6926
```
  163897af
- Add 18.08.8 NEWS lines to 19.05.0rc2 NEWS · d5b7b956
  Brian Christiansen authored May 23, 2019
```
Commits:
c2bc255c
f591f0c9
```
  d5b7b956
- Fix sacctmgr --parsable2 output for reservations and tres · 45bfc4dc
  Dominik Bartkiewicz authored Apr 30, 2019
```
Bug 6926
```
  45bfc4dc
- Fix error messages in _convert_to_name(). · 288a8e9d
  Felip Moll authored May 23, 2019
```
The name variable hasn't been set yet, so this is always NULL. Print the
uid/gid instead. While here, treat uid/gid as uint32_t, and use strtoul()
rather than atoi() to avoid issues with high-number uid/gid values.

Fixes GCC 9 warning.

Bug 7101.
```
  288a8e9d
- Make it so dependent jobs reset the AccrueTime and do not count against any AccrueTime limits. · c2bc255c
  Alejandro Sanchez authored May 22, 2019
```
Continuation of 89b791bf.

Bug 7045.
```
  c2bc255c
- Add new job bit_flags of JOB_DEPENDENT. · f591f0c9
  Alejandro Sanchez authored May 21, 2019
```
To indicate that a job is dependent or has an invalid dependency.
Not used for now, just added and removed according to its meaning.

Bug 7045.
```
  f591f0c9
- Fix packing pack_jobid in an sbcast. · adb730b4
  Albert Gil authored May 23, 2019
```
After 1d66b395 18.08 and 17.11 are the same so we can just reuse the
18.08 block instead of making a new one.

Bug 7080
```
  adb730b4
- Fix issue with a 17.11 sbcast call to a 18.08 daemon. · 1d66b395
  Albert Gil authored May 23, 2019
```
Bug 7080
```
  1d66b395
22 May, 2019 5 commits

Add 19.05 NEWS line for e7d4d593 from 18.08 · 5bbc2543
Brian Christiansen authored May 22, 2019
```
Bug 6467
```
5bbc2543

Use correct rank for cloud stepd's. · e7d4d593

Marshall Garey authored Apr 18, 2019

Job steps that run on cloud nodes and use the alias_list - in other
words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk
directly back to the slurmctld. To make that happen, we set the parent
tank of each stepd to -1. However, we also set the rank of each stepd to
0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to
the slurmctld, they would tell slurmctld to clean up node 0 in the step
allocation. So, multi-node step allocations weren't cleaning up after
the steps completed and would cause subsequent job steps to hang. The
step allocations would only clean up properly at the end of the job.

Ensure that each stepd uses the correct rank so that job steps are
properly cleaned up after each step completes.

Bug 6467.

e7d4d593

Copy two 18.08 NEWS entries to 19.05. · a7084228
Alejandro Sanchez authored May 22, 2019
```
They were associated to these two commits:

b4d7de48
6871185a

Bug 5562.
```
a7084228
Move two NEWS entries to appropriate maintenance release. · 09a7da34
Alejandro Sanchez authored May 22, 2019
```
They were associated to these two commits:

b4d7de48
6871185a

Bug 5562.
```
09a7da34
cons_tres/dist_tasks - fix variable usage in cyclic distribution. · abb732c8
Morris Jette authored May 22, 2019
```
Bug 6998.
```
abb732c8

21 May, 2019 8 commits
- job_preempt_check() - consider only jobs in an overlapping partition · 2d309f2e
  Dominik Bartkiewicz authored May 06, 2019
```
Bug 6822
```
  2d309f2e
- Change index use for lower overhead and better clarity · 23c827d9
  Moe Jette authored May 21, 2019
```
Bug 7061
```
  23c827d9
- Add 18.08.8 NEWS to 19.05.9rc2 NEWS · 09ec07ef
  Brian Christiansen authored May 21, 2019
  
  09ec07ef
- Correctly set unlimited sched_job_limit · 69621444
  Dominik Bartkiewicz authored May 06, 2019
```
unlimited could get overwritten with default queue depth preventing the
whole queue from being looked at -- especially in a high-throughput
envrionment.

Bug 6822

Co-authored-by: Morris Jette <jette@schedmd.com>
```
  69621444
- cons_res/job_test - fix to consider a node's current allocated memory. · b4d7de48
  Alejandro Sanchez authored Apr 11, 2019
```
Node memory overallocation wouldn't be properly detected since we would
just be interpreting the available memory as RealMemory - MemSpecLimit,
ignoring other job's memory usage.

Bug 5562.
```
  b4d7de48
- cons_res/job_test - prevent a job from overallocating a node memory. · 6871185a
  Alejandro Sanchez authored Apr 11, 2019
```
This compares a job memory request against each selected node available
memory, interpreting the latter for now as RealMemory - MemSpecLimit.

Bug 5562.
```
  6871185a
- Fix wrongly setting start_time to 0 for multi-part jobs. · 457e7517
  Dominik Bartkiewicz authored May 21, 2019
```
Bug 6508
```
  457e7517
- Fix DefMemPer[CPU|Node] assignment on multi-partition job requests. · 8a1e5a52
  Alejandro Sanchez authored May 09, 2019
```
Previously when no memory was explicitly requested the job was assigned
the DefMemPer[CPU|Node] from the first partition in the list (or the
cluster-wide value if the partition wasn't configured with it), even
when evaluating against a different partition.

Bug 6950.
```
  8a1e5a52
17 May, 2019 2 commits
- Fix NEWS from previous commit. · 438ffc1c
  Tim Wickberg authored May 16, 2019
```
This is select/cons_res, not select/cons_tres.
```
  438ffc1c
- Only allocate 1 CPU per node with the --overcommit and --nodelist options · 46197135
  Morris Jette authored May 10, 2019
```
Previous select/cons_res logic would allocate one CPU per task on the node

Bug 6981
```
  46197135
16 May, 2019 5 commits

Only allocate 1 CPU per node with the --overcommit option · dd7775ef
Morris Jette authored May 10, 2019
```
Previous select/cons_tres logic would allocate one CPU per task on the node

Bug 6981
```
dd7775ef

modify task layout with --overcommit · 42d7e312

Morris Jette authored May 10, 2019

Modify task layout with --overcommit option plus a heterogeneous job
allocation so that a cyclic task distribution can start happening before
all CPUs on all nodes are fully allocated. The number of tasks per node
will be unchanged from the previous algorithm, but tasks will be distributed
in a cyclic fashion first and then extra tasks placed on nodes with more
CPUs. Previously all CPUs would be fully allocated in a cyclic fashion,
then excess tasks distributed evenly across all allocated nodes.
Bug 6981

42d7e312

Store reservation flags in slurmdbd in a uint64_t. · 46d55dd4

Dominik Bartkiewicz authored May 16, 2019

Add warning to slurm.h.in that no new reservation flags can be
stored in slurmdbd in 19.05. (Although they could still be used by
slurmctld without issue.)

Note that the underlying RPC still uses uint32_t, but this will be
changed before 20.02 on master, and changing the column to uint32_t
in 19.05 just to change it again in 20.02 is best avoided.

Bug 6969.

46d55dd4

Fix memory leaks due to incomplete slurmdb_cluster_cond_t destructor. · 2038469f
Nathan Rini authored May 16, 2019
```
Free format_list, plugin_id_select_list, rpc_version_list in
 _free_cluster_cond_members().

Bug 7020.
```
2038469f

Fix archive loading events. · 0d0f9deb

Marshall Garey authored May 15, 2019

There was a syntax error in the mysql for inserting the event records
into the event table caused by commit 3d61b6aa. The syntax error was
a semicolon in the middle of the query, for example:

insert into "voyager_event_table" (time_start, time_end, node_name,
cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
'1539298628', 'v1', '', 'cold-start', '1017', '0',
'1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...

Bug 7025.

0d0f9deb