Commits · ee9db31c0ce7e6aa66c91b3e5645f71f68a1e49d · Manuel G. Marciani / ces_slurm_simulator

23 May, 2019 10 commits
- Fix reboot failure message getting to event database · ee9db31c
  Brian Christiansen authored May 16, 2019
```
The reason was being set after the message was sent to the db. Also
clear the draing and reboot states before the message is sent so that
the event state will show DOWN.

Bug 6964
```
  ee9db31c
- Create node reboot event in database · f84f60d0
  Brian Christiansen authored May 16, 2019
```
Bug 6964
```
  f84f60d0
- On reboot ASAP clear node from avail_node_bitmap · 7fc23eff
  Brian Christiansen authored May 16, 2019
```
so that new jobs can't get on the node.

Bug 6964
```
  7fc23eff
- Prevent slurmctld from potential segfault after job_start_data() called · 3238e02d
  Dominik Bartkiewicz authored Apr 30, 2019
```
for completing job.

Bug 6927
```
  3238e02d
- Update documentation for bf_max_time · a7f149be
  Ben Roberts authored May 23, 2019
```
Bug 6980
```
  a7f149be
- Update slurm.conf man page with default value of AccountingStoragePort · a20a0017
  Ben Roberts authored May 02, 2019
```
Bug 6945
```
  a20a0017
- Fix sacctmgr --parsable2 output for reservations and tres · 45bfc4dc
  Dominik Bartkiewicz authored Apr 30, 2019
```
Bug 6926
```
  45bfc4dc
- Make it so dependent jobs reset the AccrueTime and do not count against any AccrueTime limits. · c2bc255c
  Alejandro Sanchez authored May 22, 2019
```
Continuation of 89b791bf.

Bug 7045.
```
  c2bc255c
- Add new job bit_flags of JOB_DEPENDENT. · f591f0c9
  Alejandro Sanchez authored May 21, 2019
```
To indicate that a job is dependent or has an invalid dependency.
Not used for now, just added and removed according to its meaning.

Bug 7045.
```
  f591f0c9
- Fix issue with a 17.11 sbcast call to a 18.08 daemon. · 1d66b395
  Albert Gil authored May 23, 2019
```
Bug 7080
```
  1d66b395
22 May, 2019 6 commits

Clarify how SLURM_SUBMIT_DIR is set in salloc/sbatch/srun man pages. · 69d5d94b
Ben Roberts authored May 22, 2019
```
Bug 7092.
```
69d5d94b

Update error message to be more descriptive if port selection fails. · 13230170

Tim Wickberg authored May 22, 2019

Can happen if SrunPortRange has been set too small, especially on shared
login nodes launching multiple large-scale srun processes.

13230170

Update sacct man page · 1a563823
Ben Roberts authored Apr 30, 2019
```
Bug 6916
```
1a563823
Update Elastic Computing docs with TCPTimeout info · c06b1c27
Ben Roberts authored May 10, 2019
```
Bug 6995
```
c06b1c27

Use correct rank for cloud stepd's. · e7d4d593

Marshall Garey authored Apr 18, 2019

Job steps that run on cloud nodes and use the alias_list - in other
words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk
directly back to the slurmctld. To make that happen, we set the parent
tank of each stepd to -1. However, we also set the rank of each stepd to
0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to
the slurmctld, they would tell slurmctld to clean up node 0 in the step
allocation. So, multi-node step allocations weren't cleaning up after
the steps completed and would cause subsequent job steps to hang. The
step allocations would only clean up properly at the end of the job.

Ensure that each stepd uses the correct rank so that job steps are
properly cleaned up after each step completes.

Bug 6467.

e7d4d593

Move two NEWS entries to appropriate maintenance release. · 09a7da34
Alejandro Sanchez authored May 22, 2019
```
They were associated to these two commits:

b4d7de48
6871185a

Bug 5562.
```
09a7da34

21 May, 2019 6 commits

Prevent use of uninitialized variable · 1244dc98
Morris Jette authored Apr 25, 2019
```
Error reported by CLANG

Cherry pick to 18.08.

Bug 6996.
```
1244dc98

Correctly set unlimited sched_job_limit · 69621444

Dominik Bartkiewicz authored May 06, 2019



unlimited could get overwritten with default queue depth preventing the
whole queue from being looked at -- especially in a high-throughput
envrionment.

Bug 6822

Co-authored-by: Morris Jette <jette@schedmd.com>

69621444

cons_res/job_test - fix to consider a node's current allocated memory. · b4d7de48

Alejandro Sanchez authored Apr 11, 2019

Node memory overallocation wouldn't be properly detected since we would
just be interpreting the available memory as RealMemory - MemSpecLimit,
ignoring other job's memory usage.

Bug 5562.

b4d7de48

cons_res/job_test - prevent a job from overallocating a node memory. · 6871185a

Alejandro Sanchez authored Apr 11, 2019

This compares a job memory request against each selected node available
memory, interpreting the latter for now as RealMemory - MemSpecLimit.

Bug 5562.

6871185a

cons_res/job_test - non-functional code restructuring. · 406f343a

Alejandro Sanchez authored Apr 11, 2019

Place all three memory cases (per cpu, per node and all node memory) in
a single loop, since all three cases need to traverse all job_resources
selected nodes. Preparation for a follow-up commit that contains the
real fix.

Bug 5562.

406f343a

slurm.spec-legacy - package two additional plugins. · 496358f9
Tim Wickberg authored Apr 29, 2019
```
Add handling for acct_gather_energy/xcc and acct_gather_profile/influxdb.

Bug 6829.
```
496358f9

17 May, 2019 2 commits
- Fix NEWS from previous commit. · 438ffc1c
  Tim Wickberg authored May 16, 2019
```
This is select/cons_res, not select/cons_tres.
```
  438ffc1c
- Only allocate 1 CPU per node with the --overcommit and --nodelist options · 46197135
  Morris Jette authored May 10, 2019
```
Previous select/cons_res logic would allocate one CPU per task on the node

Bug 6981
```
  46197135
16 May, 2019 2 commits

Fix archive loading events. · 0d0f9deb

Marshall Garey authored May 15, 2019

There was a syntax error in the mysql for inserting the event records
into the event table caused by commit 3d61b6aa. The syntax error was
a semicolon in the middle of the query, for example:

insert into "voyager_event_table" (time_start, time_end, node_name,
cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
'1539298628', 'v1', '', 'cold-start', '1017', '0',
'1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...

Bug 7025.

0d0f9deb

Fix regression caused by . · c77d7895

Marshall Garey authored May 16, 2019

This commit caused loading usage table archive files to fail.
Specifically, wckey and assoc hourly/daily/monthly usage tables and the
cluster usage tables archive files would all fail to load.

Bug 7025.

c77d7895

15 May, 2019 2 commits

Replace stat() syscall with access(). · 3c75856d

Alejandro Sanchez authored May 15, 2019

It's more suitable for the purpose of checking if a file exists, plus
avoids the unnecessary struct stat variable since we don't care about
the file information.

Continuation of 1e234c3d.

Bug 6033.

3c75856d

Remove strncpy and snprintf from _make_archive_name. · 1871fd31

Marshall Garey authored May 15, 2019

Replace strncpy with xstrdup and snprintf with xstrfmtcat respectively
in _make_archive_name. This also fixes a coverity error CID 198462.

Continuation of 1e234c3d.

Bug 6033.

1871fd31

13 May, 2019 1 commit
- Remove stray newlines in SPANK error messages. · 3c68e645
  Tim Wickberg authored May 13, 2019
  
  3c68e645
10 May, 2019 7 commits

Document behavior of duplicate archive file names. · 7e7fd1bc
Marshall Garey authored Apr 25, 2019
```
Bug 6033.
```
7e7fd1bc

Prevent infinite loop if 0 records are archived. · df5f748d

Marshall Garey authored Apr 25, 2019

If _get_oldest_record() finds a record to archive/purge, then archive
should always archive at least one record. If for whatever reason it
fails to archive any records (_archive_table() returns a 0), then we
don't want call continue, but want to return an error. Calling continue
to go back to the beginning of the while loop would result in an
infinite loop.

Bug 6033.

df5f748d

Make archive job sql query consistent with purge. · 90471db8
Marshall Garey authored Apr 25, 2019
```
Bug 6033.
```
90471db8

Only archive 50k records at a time. · ddd49896

Marshall Garey authored Apr 24, 2019

Trying to archive too many records at once can result in archive files
that are too big to read or even too big to be written. Only archive 50k
records at a time, like we only purge 50k records at a time.

Bug 6033.

ddd49896

Handle duplicate archive file names. · 1e234c3d

Marshall Garey authored Apr 24, 2019

The time period of the archive file currently depends on submit or start
time and whether the purge period is in hours, days, or months.
Previously, if the archive file name already exists, we would overwrite
the old archive file with the assumption that these are duplicate
records being archived after an archive load. However, that could result
in lost records in a couple of ways:

  * If there were runaway jobs that were part of an old archive file's
  time period and are later fixed and then purged, the old file would
  be overwritten.
  * If jobs or steps are purged but there are still jobs or steps in
  that time period that are pending or running, the pending or running
  jobs and steps won't be purged. When they finish and are purged, the
  old file would be overwritten.

Instead of overwriting the old file, we append a number to the file name
to create a new file. This will also be important in an upcoming commit.

Bug 6033.

1e234c3d

Remove unused static variable high_buffer_size. · 3ffb4b4c
Marshall Garey authored Apr 24, 2019
```
It was set but never read.

Bug 6033.
```
3ffb4b4c

Use correct signed/unsiged types. · 4a26e486

Marshall Garey authored Apr 23, 2019

Change a few variables in archiving to use the correct signed or
unsigned type to avoid implicit casting.

Bug 6033.

4a26e486

09 May, 2019 1 commit
- mpi/pmix - remove unused _pmixp_pp_iter_count variable. · 68b1f5aa
  Broderick Gardner authored May 09, 2019
```
Bug 6799.
```
  68b1f5aa
08 May, 2019 1 commit

Renumber newly added flags to avoid a conflict in 19.05. · 26ccbec1

Tim Wickberg authored May 08, 2019

These conflict with JOB_MEM_SET/JOB_RESIZED in 19.05. Since 19.05rc1
has shipped - but no 18.08 maintenance releases have shipped with these
new flags - it is safer to renumber them here to avoid the merge conflict
going into 19.05.

Bug 6064.

26ccbec1

06 May, 2019 1 commit

Fix seff memory display overflow · bab13dfd

Felip Moll authored Apr 15, 2019

When tres_usage_in_max field is empty it is recorded as '' in the database
which leads find_tres_count_in_string() to return an INFINITE64. Seff treats
INIFINITE64 as a valid value. This patch fixes this issue.

Bug 6817

bab13dfd

03 May, 2019 1 commit
- Free memory before exiting in sacctmgr_list_runaway_jobs(). · f56fc717
  Nate Rini authored May 03, 2019
```
Bug 6880/6952.
```
  f56fc717