Commits · 0a4c5234d5048e02b47b33ed282b4325e618f24f · Manuel G. Marciani / ces_slurm_simulator

15 May, 2019 1 commit

Avoid call to slurm_get_slurmd_user_id() in _step_connect() if not slurmd. · 0a4c5234

Tim Wickberg authored May 15, 2019

For a stray socket, this call would cause nss_slurm to deadlock,
as any calling path that leads to slurm_conf_lock(), which will call
getpwuid(), which will re-enter the nss_slurm code, which will end up
back here but with the slurm_conf_lock already held, at which point
the process will never continue.

For nss_slurm, this means a node rebooting with stale sockets will hang
in the middle of the init process, which is a rather unpleasant experience.

So - only handle the stray socket cleanup within the slurmd process itself.

Bug 7030

0a4c5234

13 May, 2019 1 commit
- Remove stray newlines in SPANK error messages. · 3c68e645
  Tim Wickberg authored May 13, 2019
  
  3c68e645
10 May, 2019 3 commits

Prevent leak of cluster_str in sacctmgr_list_runaway_jobs(). · bb9d5e79
Nate Rini authored May 06, 2019
```
Bug 6952.
```
bb9d5e79

Only archive 50k records at a time. · ddd49896

Marshall Garey authored Apr 24, 2019

Trying to archive too many records at once can result in archive files
that are too big to read or even too big to be written. Only archive 50k
records at a time, like we only purge 50k records at a time.

Bug 6033.

ddd49896

Handle duplicate archive file names. · 1e234c3d

Marshall Garey authored Apr 24, 2019

The time period of the archive file currently depends on submit or start
time and whether the purge period is in hours, days, or months.
Previously, if the archive file name already exists, we would overwrite
the old archive file with the assumption that these are duplicate
records being archived after an archive load. However, that could result
in lost records in a couple of ways:

  * If there were runaway jobs that were part of an old archive file's
  time period and are later fixed and then purged, the old file would
  be overwritten.
  * If jobs or steps are purged but there are still jobs or steps in
  that time period that are pending or running, the pending or running
  jobs and steps won't be purged. When they finish and are purged, the
  old file would be overwritten.

Instead of overwriting the old file, we append a number to the file name
to create a new file. This will also be important in an upcoming commit.

Bug 6033.

1e234c3d

08 May, 2019 1 commit
- remove double occurrence of "the" in comments and docs · 1d31fe95
  Bas Nijholt authored May 07, 2019
  
  1d31fe95
07 May, 2019 3 commits
- Add missing locks around slurmctld_config.server_thread_count. · 94b81685
  Alejandro Sanchez authored May 07, 2019
```
Reported as conflicting thread load operations by valgrind --tool=drd.

Bugs 6189 and 4159.
```
  94b81685
- Revert "Add missing locks protecting slurmctld_config.server_thread_count access." · b82e9531
  Alejandro Sanchez authored May 07, 2019
```
This reverts commit f3d678d4.
```
  b82e9531
- Add missing locks protecting slurmctld_config.server_thread_count access. · f3d678d4
  Alejandro Sanchez authored Jan 10, 2019
```
Reported as conflicting thread load operations by valgrind --tool=drd.

Bugs 6189 and 4159.
```
  f3d678d4
06 May, 2019 1 commit

Fix seff memory display overflow · bab13dfd

Felip Moll authored Apr 15, 2019

When tres_usage_in_max field is empty it is recorded as '' in the database
which leads find_tres_count_in_string() to return an INFINITE64. Seff treats
INIFINITE64 as a valid value. This patch fixes this issue.

Bug 6817

bab13dfd

03 May, 2019 3 commits
- Free memory before exiting in sacctmgr_list_runaway_jobs(). · f56fc717
  Nate Rini authored May 03, 2019
```
Bug 6880/6952.
```
  f56fc717
- Ignore screen part of local DISPLAY in x11_get_display(). · e87ed60e
  Dominik Bartkiewicz authored May 03, 2019
```
Bug 6959.
```
  e87ed60e
- Add timers to site_factor plugin API calls to warn about slow plugins. · 2097c159
  Nate Rini authored May 02, 2019
```
Bug 6944.
```
  2097c159
02 May, 2019 6 commits

NVML - remove unneeded {}. · 36e2a615
Danny Auble authored May 02, 2019

36e2a615
NVML - Fix clang warning about unneeded variable initialization. · 7894dd83
Broderick Gardner authored Apr 08, 2019
```
This is the same because xstrdup returns null on null.
Bug 6812
```
7894dd83
NVML - Git rid of unneeded * when passing nvmlDevice_t to functions. · 55575ad3
Danny Auble authored May 02, 2019
```
No real code change.
```
55575ad3
Fix deprecated group by clause to use order by. · 4bc94a6d
Tim Wickberg authored May 02, 2019
```
It appears this is really what this was suppose to be anyway.

Bug 5950
```
4bc94a6d

Fix resubmit to sibling default on fed requeue · 822fe77e

Broderick Gardner authored Apr 18, 2019

On requeue, the origin cluster job record is copied to submit
to sibling clusters. If the job was originally submitted
to accept cluster default account, partition, etc, those fields
are now filled in on the origin. Here we add flags to indicate
that those fields need to be cleared on resubmission to siblings.
Bug 6064

822fe77e

Fix clearing federation cluster lock on requeue · 47909f8e

Broderick Gardner authored Mar 25, 2019

This is a holdover from when the fed job_info list was added.
The cluster lock has to be cleared from both the job_ptr and
the job_info.
Bug 6064

47909f8e

01 May, 2019 2 commits
- Start NEWS for v19.05.0rc2. · c50dbb6e
  Tim Wickberg authored Apr 30, 2019
  
  c50dbb6e
- Change NEWS header for 19.05.0rc1 · 260b7e09
  Tim Wickberg authored Apr 30, 2019
  
  260b7e09
30 Apr, 2019 4 commits
- Fix multi-cluster srun's with select/cray · 7c100f74
  Matt Ezell authored Oct 17, 2018
```
and other_cons_res.

continuation of previous commit.

Bug 5680
```
  7c100f74
- Fix memory leak in group_cache.c · 876bd712
  Danny Auble authored Apr 30, 2019
```
Blessed by Tim.
```
  876bd712
- Expanded usagefactor to match the documentation · 43ef4f75
  Jason Booth authored Apr 16, 2019
```
Usagefactor matches the documentation and now multiplies TRES time
limits and usage.

Bug 5435
```
  43ef4f75
- pmi2: add mutex around all API calls to ensure thread safety. · 541c6b50
  Dineshkumar RAJAGOPAL authored Jan 18, 2019
```
This is very coarse-grained locking, but as the initial implementation
did not anticipate concurrent access this is the safest approach for now.

Bug 5638.
```
  541c6b50
29 Apr, 2019 11 commits
- Add NEWS and RELEASE_NOTES for slurm.spec changes on Cray Aries systems. · 9a761238
  Tim Wickberg authored Apr 29, 2019
```
Bug 6632.
```
  9a761238
- Add NEWS for previous two commits · 00a8e724
  Brian Christiansen authored Apr 25, 2019
```
Bug 6513
```
  00a8e724
- Fix printing duplicate error messages of lua rejected jobs · 297a6880
  Nate Rini authored Apr 22, 2019
```
Regression from 70b4e06d.

Bug 6892.
```
  297a6880
- Fix segfault when loading/unloading lua job submit plugin multiple times · 8920863a
  Nate Rini authored Apr 22, 2019
```
Bug 6895.
```
  8920863a
- Allow submit plugins to be turned on and off with scontrol reconfig · a0e14237
  Brian Christiansen authored Apr 25, 2019
```
Bug 6895
```
  a0e14237
- Fix unnecessary reloading of submit plugins · b50ac244
  Brian Christiansen authored Apr 24, 2019
```
Bug 6895
```
  b50ac244
- mpi/pmix: replace the PMIX_VAL_SET macro with PMIX_INFO_LOAD · f13369b0
  Boris Karasev authored Feb 28, 2019
```
PMIX_VAL_SET will not be supported in PMIx v4 or later. This commit changes
the use of the old (and non-standard) PMIX_VAL_SET macro to the standardized
PMIX_INFO_LOAD (which is used within a new internal PMIXP_KVP_ADD macro).

Bug 6624.
```
  f13369b0
- mpi/pmix: use the Tree-based collective type for empty fence data. · fa5620df
  Boris Karasev authored Mar 04, 2019
```
This commit changes the logic of selecting a type of collective. The
Tree-based algorithm will be selected when fence with an empty data
contribution, which allows for improved fence performance.

Bug 6637.
```
  fa5620df
- Add NEWS and RELEASE_NOTES entry for pam_slurm_adopt change. · 9763be13
  Dominik Bartkiewicz authored Apr 29, 2019
```
Bug 6411.
```
  9763be13
- Make alloc_booting_nodes the default behavior. · a626691b
  Brian Christiansen authored Apr 29, 2019
```
Continuation of 36c30487

Bug 6782
```
  a626691b
- Add initial documentation for the cli_filter plugin. · c175a538
  Doug Jacobsen authored Mar 28, 2019
  
  c175a538
26 Apr, 2019 3 commits

Limit records per single SQL statement when loading archived data. · 34e9d41b

Nate Rini authored Apr 26, 2019



Otherwise, we could send communication packets bigger than max_allowed_packet.

Bug 6832.

Co-authored-by: Tim Wickberg <tim@schedmd.com>

34e9d41b

accounting_storage/mysql - fix memory leak in the archive load logic. · e8567e06
Alejandro Sanchez authored Apr 26, 2019
```
Regression introduced in 8d643e79.

Bug 6832.
```
e8567e06

accounting_storage/mysql - fix SIGABRT in the archive load logic. · e174e135

Nate Rini authored Apr 26, 2019

The problem was freeing an interior pointer to buffer contents before
the call to FREE_NULL_BUFFER. The issue was only triggered when loading
an archived data with protocol version < 17.11.

Regression introduced in 8d643e79.

Bug 6832.

e174e135

25 Apr, 2019 1 commit
- Allow Het Jobs to work on a Cray. · a4b6130a
  Danny Auble authored Apr 25, 2019
  
  a4b6130a