Commits · c5482f48a4fb4f10da5579d181bd9b76ba6b5fec · Manuel G. Marciani / ces_slurm_simulator

21 May, 2019 8 commits
- job_preempt_check() - consider only jobs in an overlapping partition · 2d309f2e
  Dominik Bartkiewicz authored May 06, 2019
```
Bug 6822
```
  2d309f2e
- Change index use for lower overhead and better clarity · 23c827d9
  Moe Jette authored May 21, 2019
```
Bug 7061
```
  23c827d9
- Add 18.08.8 NEWS to 19.05.9rc2 NEWS · 09ec07ef
  Brian Christiansen authored May 21, 2019
  
  09ec07ef
- Correctly set unlimited sched_job_limit · 69621444
  Dominik Bartkiewicz authored May 06, 2019
```
unlimited could get overwritten with default queue depth preventing the
whole queue from being looked at -- especially in a high-throughput
envrionment.

Bug 6822

Co-authored-by: Morris Jette <jette@schedmd.com>
```
  69621444
- cons_res/job_test - fix to consider a node's current allocated memory. · b4d7de48
  Alejandro Sanchez authored Apr 11, 2019
```
Node memory overallocation wouldn't be properly detected since we would
just be interpreting the available memory as RealMemory - MemSpecLimit,
ignoring other job's memory usage.

Bug 5562.
```
  b4d7de48
- cons_res/job_test - prevent a job from overallocating a node memory. · 6871185a
  Alejandro Sanchez authored Apr 11, 2019
```
This compares a job memory request against each selected node available
memory, interpreting the latter for now as RealMemory - MemSpecLimit.

Bug 5562.
```
  6871185a
- Fix wrongly setting start_time to 0 for multi-part jobs. · 457e7517
  Dominik Bartkiewicz authored May 21, 2019
```
Bug 6508
```
  457e7517
- Fix DefMemPer[CPU|Node] assignment on multi-partition job requests. · 8a1e5a52
  Alejandro Sanchez authored May 09, 2019
```
Previously when no memory was explicitly requested the job was assigned
the DefMemPer[CPU|Node] from the first partition in the list (or the
cluster-wide value if the partition wasn't configured with it), even
when evaluating against a different partition.

Bug 6950.
```
  8a1e5a52
17 May, 2019 2 commits
- Fix NEWS from previous commit. · 438ffc1c
  Tim Wickberg authored May 16, 2019
```
This is select/cons_res, not select/cons_tres.
```
  438ffc1c
- Only allocate 1 CPU per node with the --overcommit and --nodelist options · 46197135
  Morris Jette authored May 10, 2019
```
Previous select/cons_res logic would allocate one CPU per task on the node

Bug 6981
```
  46197135
16 May, 2019 5 commits

Only allocate 1 CPU per node with the --overcommit option · dd7775ef
Morris Jette authored May 10, 2019
```
Previous select/cons_tres logic would allocate one CPU per task on the node

Bug 6981
```
dd7775ef

modify task layout with --overcommit · 42d7e312

Morris Jette authored May 10, 2019

Modify task layout with --overcommit option plus a heterogeneous job
allocation so that a cyclic task distribution can start happening before
all CPUs on all nodes are fully allocated. The number of tasks per node
will be unchanged from the previous algorithm, but tasks will be distributed
in a cyclic fashion first and then extra tasks placed on nodes with more
CPUs. Previously all CPUs would be fully allocated in a cyclic fashion,
then excess tasks distributed evenly across all allocated nodes.
Bug 6981

42d7e312

Store reservation flags in slurmdbd in a uint64_t. · 46d55dd4

Dominik Bartkiewicz authored May 16, 2019

Add warning to slurm.h.in that no new reservation flags can be
stored in slurmdbd in 19.05. (Although they could still be used by
slurmctld without issue.)

Note that the underlying RPC still uses uint32_t, but this will be
changed before 20.02 on master, and changing the column to uint32_t
in 19.05 just to change it again in 20.02 is best avoided.

Bug 6969.

46d55dd4

Fix memory leaks due to incomplete slurmdb_cluster_cond_t destructor. · 2038469f
Nathan Rini authored May 16, 2019
```
Free format_list, plugin_id_select_list, rpc_version_list in
 _free_cluster_cond_members().

Bug 7020.
```
2038469f

Fix archive loading events. · 0d0f9deb

Marshall Garey authored May 15, 2019

There was a syntax error in the mysql for inserting the event records
into the event table caused by commit 3d61b6aa. The syntax error was
a semicolon in the middle of the query, for example:

insert into "voyager_event_table" (time_start, time_end, node_name,
cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
'1539298628', 'v1', '', 'cold-start', '1017', '0',
'1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...

Bug 7025.

0d0f9deb

15 May, 2019 1 commit

Avoid call to slurm_get_slurmd_user_id() in _step_connect() if not slurmd. · 0a4c5234

Tim Wickberg authored May 15, 2019

For a stray socket, this call would cause nss_slurm to deadlock,
as any calling path that leads to slurm_conf_lock(), which will call
getpwuid(), which will re-enter the nss_slurm code, which will end up
back here but with the slurm_conf_lock already held, at which point
the process will never continue.

For nss_slurm, this means a node rebooting with stale sockets will hang
in the middle of the init process, which is a rather unpleasant experience.

So - only handle the stray socket cleanup within the slurmd process itself.

Bug 7030

0a4c5234

13 May, 2019 1 commit
- Remove stray newlines in SPANK error messages. · 3c68e645
  Tim Wickberg authored May 13, 2019
  
  3c68e645
10 May, 2019 3 commits

Prevent leak of cluster_str in sacctmgr_list_runaway_jobs(). · bb9d5e79
Nate Rini authored May 06, 2019
```
Bug 6952.
```
bb9d5e79

Only archive 50k records at a time. · ddd49896

Marshall Garey authored Apr 24, 2019

Trying to archive too many records at once can result in archive files
that are too big to read or even too big to be written. Only archive 50k
records at a time, like we only purge 50k records at a time.

Bug 6033.

ddd49896

Handle duplicate archive file names. · 1e234c3d

Marshall Garey authored Apr 24, 2019

The time period of the archive file currently depends on submit or start
time and whether the purge period is in hours, days, or months.
Previously, if the archive file name already exists, we would overwrite
the old archive file with the assumption that these are duplicate
records being archived after an archive load. However, that could result
in lost records in a couple of ways:

  * If there were runaway jobs that were part of an old archive file's
  time period and are later fixed and then purged, the old file would
  be overwritten.
  * If jobs or steps are purged but there are still jobs or steps in
  that time period that are pending or running, the pending or running
  jobs and steps won't be purged. When they finish and are purged, the
  old file would be overwritten.

Instead of overwriting the old file, we append a number to the file name
to create a new file. This will also be important in an upcoming commit.

Bug 6033.

1e234c3d

08 May, 2019 1 commit
- remove double occurrence of "the" in comments and docs · 1d31fe95
  Bas Nijholt authored May 07, 2019
  
  1d31fe95
07 May, 2019 3 commits
- Add missing locks around slurmctld_config.server_thread_count. · 94b81685
  Alejandro Sanchez authored May 07, 2019
```
Reported as conflicting thread load operations by valgrind --tool=drd.

Bugs 6189 and 4159.
```
  94b81685
- Revert "Add missing locks protecting slurmctld_config.server_thread_count access." · b82e9531
  Alejandro Sanchez authored May 07, 2019
```
This reverts commit f3d678d4.
```
  b82e9531
- Add missing locks protecting slurmctld_config.server_thread_count access. · f3d678d4
  Alejandro Sanchez authored Jan 10, 2019
```
Reported as conflicting thread load operations by valgrind --tool=drd.

Bugs 6189 and 4159.
```
  f3d678d4
06 May, 2019 1 commit

Fix seff memory display overflow · bab13dfd

Felip Moll authored Apr 15, 2019

When tres_usage_in_max field is empty it is recorded as '' in the database
which leads find_tres_count_in_string() to return an INFINITE64. Seff treats
INIFINITE64 as a valid value. This patch fixes this issue.

Bug 6817

bab13dfd

03 May, 2019 3 commits
- Free memory before exiting in sacctmgr_list_runaway_jobs(). · f56fc717
  Nate Rini authored May 03, 2019
```
Bug 6880/6952.
```
  f56fc717
- Ignore screen part of local DISPLAY in x11_get_display(). · e87ed60e
  Dominik Bartkiewicz authored May 03, 2019
```
Bug 6959.
```
  e87ed60e
- Add timers to site_factor plugin API calls to warn about slow plugins. · 2097c159
  Nate Rini authored May 02, 2019
```
Bug 6944.
```
  2097c159
02 May, 2019 6 commits

NVML - remove unneeded {}. · 36e2a615
Danny Auble authored May 02, 2019

36e2a615
NVML - Fix clang warning about unneeded variable initialization. · 7894dd83
Broderick Gardner authored Apr 08, 2019
```
This is the same because xstrdup returns null on null.
Bug 6812
```
7894dd83
NVML - Git rid of unneeded * when passing nvmlDevice_t to functions. · 55575ad3
Danny Auble authored May 02, 2019
```
No real code change.
```
55575ad3
Fix deprecated group by clause to use order by. · 4bc94a6d
Tim Wickberg authored May 02, 2019
```
It appears this is really what this was suppose to be anyway.

Bug 5950
```
4bc94a6d

Fix resubmit to sibling default on fed requeue · 822fe77e

Broderick Gardner authored Apr 18, 2019

On requeue, the origin cluster job record is copied to submit
to sibling clusters. If the job was originally submitted
to accept cluster default account, partition, etc, those fields
are now filled in on the origin. Here we add flags to indicate
that those fields need to be cleared on resubmission to siblings.
Bug 6064

822fe77e

Fix clearing federation cluster lock on requeue · 47909f8e

Broderick Gardner authored Mar 25, 2019

This is a holdover from when the fed job_info list was added.
The cluster lock has to be cleared from both the job_ptr and
the job_info.
Bug 6064

47909f8e

01 May, 2019 2 commits
- Start NEWS for v19.05.0rc2. · c50dbb6e
  Tim Wickberg authored Apr 30, 2019
  
  c50dbb6e
- Change NEWS header for 19.05.0rc1 · 260b7e09
  Tim Wickberg authored Apr 30, 2019
  
  260b7e09
30 Apr, 2019 4 commits
- Fix multi-cluster srun's with select/cray · 7c100f74
  Matt Ezell authored Oct 17, 2018
```
and other_cons_res.

continuation of previous commit.

Bug 5680
```
  7c100f74
- Fix memory leak in group_cache.c · 876bd712
  Danny Auble authored Apr 30, 2019
```
Blessed by Tim.
```
  876bd712
- Expanded usagefactor to match the documentation · 43ef4f75
  Jason Booth authored Apr 16, 2019
```
Usagefactor matches the documentation and now multiplies TRES time
limits and usage.

Bug 5435
```
  43ef4f75
- pmi2: add mutex around all API calls to ensure thread safety. · 541c6b50
  Dineshkumar RAJAGOPAL authored Jan 18, 2019
```
This is very coarse-grained locking, but as the initial implementation
did not anticipate concurrent access this is the safest approach for now.

Bug 5638.
```
  541c6b50