Commits · 6db8d02d85d18ff565d45cc136c971242328fdc6 · Manuel G. Marciani / ces_slurm_simulator

16 May, 2019 3 commits

Merge branch 'slurm-18.08' into slurm-19.05 · 6db8d02d
Alejandro Sanchez authored May 16, 2019

6db8d02d

Marshall Garey authored May 15, 2019

There was a syntax error in the mysql for inserting the event records
into the event table caused by commit 3d61b6aa. The syntax error was
a semicolon in the middle of the query, for example:

insert into "voyager_event_table" (time_start, time_end, node_name,
cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
'1539298628', 'v1', '', 'cold-start', '1017', '0',
'1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...

Bug 7025.

0d0f9deb

Fix regression caused by . · c77d7895

Marshall Garey authored May 16, 2019

This commit caused loading usage table archive files to fail.
Specifically, wckey and assoc hourly/daily/monthly usage tables and the
cluster usage tables archive files would all fail to load.

Bug 7025.

c77d7895

15 May, 2019 5 commits

Avoid call to slurm_get_slurmd_user_id() in _step_connect() if not slurmd. · 0a4c5234

Tim Wickberg authored May 15, 2019

For a stray socket, this call would cause nss_slurm to deadlock,
as any calling path that leads to slurm_conf_lock(), which will call
getpwuid(), which will re-enter the nss_slurm code, which will end up
back here but with the slurm_conf_lock already held, at which point
the process will never continue.

For nss_slurm, this means a node rebooting with stale sockets will hang
in the middle of the init process, which is a rather unpleasant experience.

So - only handle the stray socket cleanup within the slurmd process itself.

Bug 7030

0a4c5234

Merge branch 'slurm-18.08' into slurm-19.05 · e6a06c3d
Alejandro Sanchez authored May 15, 2019

e6a06c3d

Replace stat() syscall with access(). · 3c75856d

Alejandro Sanchez authored May 15, 2019

It's more suitable for the purpose of checking if a file exists, plus
avoids the unnecessary struct stat variable since we don't care about
the file information.

Continuation of 1e234c3d.

Bug 6033.

3c75856d

Remove strncpy and snprintf from _make_archive_name. · 1871fd31

Marshall Garey authored May 15, 2019

Replace strncpy with xstrdup and snprintf with xstrfmtcat respectively
in _make_archive_name. This also fixes a coverity error CID 198462.

Continuation of 1e234c3d.

Bug 6033.

1871fd31

Modify gres/gpu logic for multiple socket use · 636e45a8
Morris Jette authored May 14, 2019

636e45a8

14 May, 2019 3 commits
- Alter tests to use new helper functions · 2032520f
  Danny Auble authored May 14, 2019
```
Continuation of 3beabdb1
```
  2032520f
- Add helper functions to determine which real select plugin we are using. · 24aa8365
  Danny Auble authored May 14, 2019
```
Continuation of 3beabdb1
```
  24aa8365
- Fixes for gres/gpu testsa · 6b33e4a7
  Morris Jette authored May 13, 2019
```
These test changes are designed to support gres/gpu configurations
where only some sockets actually have GPUs. The tests will not work
with all possible configurations, but this change will result in the
tests working in more cases.
```
  6b33e4a7
13 May, 2019 4 commits
- Change select name checks from cray to cray_aries · 3beabdb1
  Morris Jette authored May 13, 2019
```
select/cray replaced by select/cray_aries in tests
```
  3beabdb1
- Merge branch 'slurm-18.08' into slurm-19.05 · 2f020f50
  Tim Wickberg authored May 13, 2019
  
  2f020f50
- Remove stray newlines in SPANK error messages. · 3c68e645
  Tim Wickberg authored May 13, 2019
  
  3c68e645
- Fix typo in scontrol man page. · 6c2a493a
  Chad Vizino authored May 13, 2019
```
Bug 6902.
```
  6c2a493a
11 May, 2019 3 commits
- Insure that all test jobs complete · d1b472a1
  Morris Jette authored May 10, 2019
```
If they do not, then explicitly cancel them
```
  d1b472a1
- Add time limit to test jobs to avoid possible vestigial jobs on failure · f13d76e5
  Morris Jette authored May 10, 2019
  
  f13d76e5
- Modify test's memory requirement · a70d4e34
  Morris Jette authored May 10, 2019
```
Change to work on CPU-rich / memory-poor nodes.
```
  a70d4e34
10 May, 2019 12 commits

Prevent leak of cluster_str in sacctmgr_list_runaway_jobs(). · bb9d5e79
Nate Rini authored May 06, 2019
```
Bug 6952.
```
bb9d5e79
Add cleanup to _get_runaway_jobs() · 18c045d7
Nate Rini authored May 06, 2019
```
Fix leaks of cluster_list and db_jobs_list.

Bug 6952.
```
18c045d7
Remove extra whitespace in clus_jobs declaration. · 7bbb6220
Nate Rini authored May 06, 2019
```
No functional change.

Bug 6952.
```
7bbb6220

Cleanup runaway jobs list to avoid leaking memory. · 6f60c6ca

Nate Rini authored May 06, 2019

Call _purge_known_jobs() from _get_runaway_jobs() to purge
known jobs (to slurmctld) from the list.

Removed secondary list runaway_jobs as it was no longer needed.
This also avoids leaking all the runaway_jobs.

Bug 6952.

6f60c6ca

Merge branch 'slurm-18.08' into slurm-19.05 · 62dc419e
Alejandro Sanchez authored May 10, 2019

62dc419e
Document behavior of duplicate archive file names. · 7e7fd1bc
Marshall Garey authored Apr 25, 2019
```
Bug 6033.
```
7e7fd1bc

Prevent infinite loop if 0 records are archived. · df5f748d

Marshall Garey authored Apr 25, 2019

If _get_oldest_record() finds a record to archive/purge, then archive
should always archive at least one record. If for whatever reason it
fails to archive any records (_archive_table() returns a 0), then we
don't want call continue, but want to return an error. Calling continue
to go back to the beginning of the while loop would result in an
infinite loop.

Bug 6033.

df5f748d

Make archive job sql query consistent with purge. · 90471db8
Marshall Garey authored Apr 25, 2019
```
Bug 6033.
```
90471db8

Only archive 50k records at a time. · ddd49896

Marshall Garey authored Apr 24, 2019

Trying to archive too many records at once can result in archive files
that are too big to read or even too big to be written. Only archive 50k
records at a time, like we only purge 50k records at a time.

Bug 6033.

ddd49896

Handle duplicate archive file names. · 1e234c3d

Marshall Garey authored Apr 24, 2019

The time period of the archive file currently depends on submit or start
time and whether the purge period is in hours, days, or months.
Previously, if the archive file name already exists, we would overwrite
the old archive file with the assumption that these are duplicate
records being archived after an archive load. However, that could result
in lost records in a couple of ways:

  * If there were runaway jobs that were part of an old archive file's
  time period and are later fixed and then purged, the old file would
  be overwritten.
  * If jobs or steps are purged but there are still jobs or steps in
  that time period that are pending or running, the pending or running
  jobs and steps won't be purged. When they finish and are purged, the
  old file would be overwritten.

Instead of overwriting the old file, we append a number to the file name
to create a new file. This will also be important in an upcoming commit.

Bug 6033.

1e234c3d

Remove unused static variable high_buffer_size. · 3ffb4b4c
Marshall Garey authored Apr 24, 2019
```
It was set but never read.

Bug 6033.
```
3ffb4b4c

Use correct signed/unsiged types. · 4a26e486

Marshall Garey authored Apr 23, 2019

Change a few variables in archiving to use the correct signed or
unsigned type to avoid implicit casting.

Bug 6033.

4a26e486

09 May, 2019 6 commits
- Cosmetic changes to documentation · 030e9157
  Morris Jette authored May 09, 2019
```
Replace "hetjob" with "heterogeneous job" for better clarity.
```
  030e9157
- Improve logging if cons_tres bitmap size is wrong · 0b05f04f
  Morris Jette authored May 09, 2019
```
This just adds addition debugging information to an error message.

bug 6990
```
  0b05f04f
- make cons_tres work like cons_res for slurmd config parsing · 7d4ec21f
  Morris Jette authored May 09, 2019
```
Otherwise with CR_CPU and threads defined then slurmd will report
the core count as the CPU count and mess up scheduling.

bug 6990
```
  7d4ec21f
- doc/html/team.shtml - add Marcin to Slurm Team page. · 2522a7e5
  Marcin Stolarek authored May 09, 2019
```
Bug 6966.
```
  2522a7e5
- mpi/pmix - remove unused _pmixp_pp_iter_count variable. · 68b1f5aa
  Broderick Gardner authored May 09, 2019
```
Bug 6799.
```
  68b1f5aa
- Fix typo in sreport output · 7982a86a
  Chad Vizino authored Apr 12, 2019
```
Bug 6854
```
  7982a86a
08 May, 2019 4 commits
- Merge branch 'slurm-18.08' into slurm-19.05 · 65b74886
  Tim Wickberg authored May 08, 2019
  
  65b74886
- Renumber newly added flags to avoid a conflict in 19.05. · 26ccbec1
  Tim Wickberg authored May 08, 2019
```
These conflict with JOB_MEM_SET/JOB_RESIZED in 19.05. Since 19.05rc1
has shipped - but no 18.08 maintenance releases have shipped with these
new flags - it is safer to renumber them here to avoid the merge conflict
going into 19.05.

Bug 6064.
```
  26ccbec1
- Add Bas Nijholt to contributor list · 900169c8
  Morris Jette authored May 07, 2019
  
  900169c8
- remove double occurrence of "the" in comments and docs · 1d31fe95
  Bas Nijholt authored May 07, 2019
  
  1d31fe95