Commits · c06b1c27d20d6e5ee3ccf8d8d7230e71c0dfcff9 · Manuel G. Marciani / ces_slurm_simulator

22 May, 2019 3 commits

Update Elastic Computing docs with TCPTimeout info · c06b1c27
Ben Roberts authored May 10, 2019
```
Bug 6995
```
c06b1c27

Use correct rank for cloud stepd's. · e7d4d593

Marshall Garey authored Apr 18, 2019

Job steps that run on cloud nodes and use the alias_list - in other
words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk
directly back to the slurmctld. To make that happen, we set the parent
tank of each stepd to -1. However, we also set the rank of each stepd to
0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to
the slurmctld, they would tell slurmctld to clean up node 0 in the step
allocation. So, multi-node step allocations weren't cleaning up after
the steps completed and would cause subsequent job steps to hang. The
step allocations would only clean up properly at the end of the job.

Ensure that each stepd uses the correct rank so that job steps are
properly cleaned up after each step completes.

Bug 6467.

e7d4d593

Move two NEWS entries to appropriate maintenance release. · 09a7da34
Alejandro Sanchez authored May 22, 2019
```
They were associated to these two commits:

b4d7de48
6871185a

Bug 5562.
```
09a7da34

21 May, 2019 6 commits

Prevent use of uninitialized variable · 1244dc98
Morris Jette authored Apr 25, 2019
```
Error reported by CLANG

Cherry pick to 18.08.

Bug 6996.
```
1244dc98

Correctly set unlimited sched_job_limit · 69621444

Dominik Bartkiewicz authored May 06, 2019



unlimited could get overwritten with default queue depth preventing the
whole queue from being looked at -- especially in a high-throughput
envrionment.

Bug 6822

Co-authored-by: Morris Jette <jette@schedmd.com>

69621444

cons_res/job_test - fix to consider a node's current allocated memory. · b4d7de48

Alejandro Sanchez authored Apr 11, 2019

Node memory overallocation wouldn't be properly detected since we would
just be interpreting the available memory as RealMemory - MemSpecLimit,
ignoring other job's memory usage.

Bug 5562.

b4d7de48

cons_res/job_test - prevent a job from overallocating a node memory. · 6871185a

Alejandro Sanchez authored Apr 11, 2019

This compares a job memory request against each selected node available
memory, interpreting the latter for now as RealMemory - MemSpecLimit.

Bug 5562.

6871185a

cons_res/job_test - non-functional code restructuring. · 406f343a

Alejandro Sanchez authored Apr 11, 2019

Place all three memory cases (per cpu, per node and all node memory) in
a single loop, since all three cases need to traverse all job_resources
selected nodes. Preparation for a follow-up commit that contains the
real fix.

Bug 5562.

406f343a

slurm.spec-legacy - package two additional plugins. · 496358f9
Tim Wickberg authored Apr 29, 2019
```
Add handling for acct_gather_energy/xcc and acct_gather_profile/influxdb.

Bug 6829.
```
496358f9

17 May, 2019 2 commits
- Fix NEWS from previous commit. · 438ffc1c
  Tim Wickberg authored May 16, 2019
```
This is select/cons_res, not select/cons_tres.
```
  438ffc1c
- Only allocate 1 CPU per node with the --overcommit and --nodelist options · 46197135
  Morris Jette authored May 10, 2019
```
Previous select/cons_res logic would allocate one CPU per task on the node

Bug 6981
```
  46197135
16 May, 2019 2 commits

Fix archive loading events. · 0d0f9deb

Marshall Garey authored May 15, 2019

There was a syntax error in the mysql for inserting the event records
into the event table caused by commit 3d61b6aa. The syntax error was
a semicolon in the middle of the query, for example:

insert into "voyager_event_table" (time_start, time_end, node_name,
cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
'1539298628', 'v1', '', 'cold-start', '1017', '0',
'1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...

Bug 7025.

0d0f9deb

Fix regression caused by . · c77d7895

Marshall Garey authored May 16, 2019

This commit caused loading usage table archive files to fail.
Specifically, wckey and assoc hourly/daily/monthly usage tables and the
cluster usage tables archive files would all fail to load.

Bug 7025.

c77d7895

15 May, 2019 2 commits

Replace stat() syscall with access(). · 3c75856d

Alejandro Sanchez authored May 15, 2019

It's more suitable for the purpose of checking if a file exists, plus
avoids the unnecessary struct stat variable since we don't care about
the file information.

Continuation of 1e234c3d.

Bug 6033.

3c75856d

Remove strncpy and snprintf from _make_archive_name. · 1871fd31

Marshall Garey authored May 15, 2019

Replace strncpy with xstrdup and snprintf with xstrfmtcat respectively
in _make_archive_name. This also fixes a coverity error CID 198462.

Continuation of 1e234c3d.

Bug 6033.

1871fd31

13 May, 2019 1 commit
- Remove stray newlines in SPANK error messages. · 3c68e645
  Tim Wickberg authored May 13, 2019
  
  3c68e645
10 May, 2019 7 commits

Document behavior of duplicate archive file names. · 7e7fd1bc
Marshall Garey authored Apr 25, 2019
```
Bug 6033.
```
7e7fd1bc

Prevent infinite loop if 0 records are archived. · df5f748d

Marshall Garey authored Apr 25, 2019

If _get_oldest_record() finds a record to archive/purge, then archive
should always archive at least one record. If for whatever reason it
fails to archive any records (_archive_table() returns a 0), then we
don't want call continue, but want to return an error. Calling continue
to go back to the beginning of the while loop would result in an
infinite loop.

Bug 6033.

df5f748d

Make archive job sql query consistent with purge. · 90471db8
Marshall Garey authored Apr 25, 2019
```
Bug 6033.
```
90471db8

Only archive 50k records at a time. · ddd49896

Marshall Garey authored Apr 24, 2019

Trying to archive too many records at once can result in archive files
that are too big to read or even too big to be written. Only archive 50k
records at a time, like we only purge 50k records at a time.

Bug 6033.

ddd49896

Handle duplicate archive file names. · 1e234c3d

Marshall Garey authored Apr 24, 2019

The time period of the archive file currently depends on submit or start
time and whether the purge period is in hours, days, or months.
Previously, if the archive file name already exists, we would overwrite
the old archive file with the assumption that these are duplicate
records being archived after an archive load. However, that could result
in lost records in a couple of ways:

  * If there were runaway jobs that were part of an old archive file's
  time period and are later fixed and then purged, the old file would
  be overwritten.
  * If jobs or steps are purged but there are still jobs or steps in
  that time period that are pending or running, the pending or running
  jobs and steps won't be purged. When they finish and are purged, the
  old file would be overwritten.

Instead of overwriting the old file, we append a number to the file name
to create a new file. This will also be important in an upcoming commit.

Bug 6033.

1e234c3d

Remove unused static variable high_buffer_size. · 3ffb4b4c
Marshall Garey authored Apr 24, 2019
```
It was set but never read.

Bug 6033.
```
3ffb4b4c

Use correct signed/unsiged types. · 4a26e486

Marshall Garey authored Apr 23, 2019

Change a few variables in archiving to use the correct signed or
unsigned type to avoid implicit casting.

Bug 6033.

4a26e486

09 May, 2019 1 commit
- mpi/pmix - remove unused _pmixp_pp_iter_count variable. · 68b1f5aa
  Broderick Gardner authored May 09, 2019
```
Bug 6799.
```
  68b1f5aa
08 May, 2019 1 commit

Renumber newly added flags to avoid a conflict in 19.05. · 26ccbec1

Tim Wickberg authored May 08, 2019

These conflict with JOB_MEM_SET/JOB_RESIZED in 19.05. Since 19.05rc1
has shipped - but no 18.08 maintenance releases have shipped with these
new flags - it is safer to renumber them here to avoid the merge conflict
going into 19.05.

Bug 6064.

26ccbec1

06 May, 2019 1 commit

Fix seff memory display overflow · bab13dfd

Felip Moll authored Apr 15, 2019

When tres_usage_in_max field is empty it is recorded as '' in the database
which leads find_tres_count_in_string() to return an INFINITE64. Seff treats
INIFINITE64 as a valid value. This patch fixes this issue.

Bug 6817

bab13dfd

03 May, 2019 1 commit
- Free memory before exiting in sacctmgr_list_runaway_jobs(). · f56fc717
  Nate Rini authored May 03, 2019
```
Bug 6880/6952.
```
  f56fc717
02 May, 2019 3 commits

Copy job_ptr->bit_flags to job_desc->bitflags · 2628d3dc
Broderick Gardner authored Apr 25, 2019
```
Bug 6064
```
2628d3dc

Fix resubmit to sibling default on fed requeue · 822fe77e

Broderick Gardner authored Apr 18, 2019

On requeue, the origin cluster job record is copied to submit
to sibling clusters. If the job was originally submitted
to accept cluster default account, partition, etc, those fields
are now filled in on the origin. Here we add flags to indicate
that those fields need to be cleared on resubmission to siblings.
Bug 6064

822fe77e

Fix clearing federation cluster lock on requeue · 47909f8e

Broderick Gardner authored Mar 25, 2019

This is a holdover from when the fed job_info list was added.
The cluster lock has to be cleared from both the job_ptr and
the job_info.
Bug 6064

47909f8e

30 Apr, 2019 1 commit
- Fix memory leak in group_cache.c · 876bd712
  Danny Auble authored Apr 30, 2019
```
Blessed by Tim.
```
  876bd712
29 Apr, 2019 9 commits

Update test7.20 to catch passing/failing het jobs · 8c4fdffe
Brian Christiansen authored Apr 29, 2019
```
when one offset passes and other fails.

Bug 6892
```
8c4fdffe
Add test7.20 · 1460a6b5
Nate Rini authored Mar 20, 2019
```
Bug 6513.
```
1460a6b5
Add NEWS for previous two commits · 00a8e724
Brian Christiansen authored Apr 25, 2019
```
Bug 6513
```
00a8e724

Fix bad sbatch het offset output · 4657ab94

Brian Christiansen authored Apr 24, 2019

Bug 6513

First offset is good but second is bad -- didn't request task count.

$ cat etc/job_submit.lua
function slurm_job_submit(job_desc, part_list, submit_uid)
        slurm.log_user("submit1\nstuff")
        slurm.log_user("submit2")
        slurm.log_user("submit3")

    -- slurm.log_user("case 0")
    if job_desc.num_tasks == slurm.NO_VAL or job_desc.num_tasks == nil then
        slurm.log_user("Batch submit error:  Must specify either number of nodes or number of tasks!")
        -- reject the job
        return slurm.ERROR
    end

        return slurm.SUCCESS
end

function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
        slurm.log_user("modify1")
        slurm.log_user("modify2")
        slurm.log_user("modify3")
        return slurm.SUCCESS
end

slurm.log_user("initialized")
return slurm.SUCCESS

$ sbatch -Ablah2 -n1 --wrap="hostname" : -J asdfl
sbatch: error: 0: initialized
sbatch: error: 0: submit1
sbatch: error: 0: stuff
sbatch: error: 0: submit2
sbatch: error: 0: submit3
sbatch: error: submit1
sbatch: error: stuff
sbatch: error: submit2
sbatch: error: submit3
sbatch: error: Batch submit error:  Must specify either number of nodes or number of tasks!
sbatch: error: Batch job submission failed: Unspecified error

$ sbatch -Ablah2 -n1 --wrap="hostname" : -J asdfl
sbatch: error: 0: initialized
sbatch: error: 0: submit1
sbatch: error: 0: stuff
sbatch: error: 0: submit2
sbatch: error: 0: submit3
sbatch: error: 1: submit1
sbatch: error: 1: stuff
sbatch: error: 1: submit2
sbatch: error: 1: submit3
sbatch: error: 1: Batch submit error:  Must specify either number of nodes or number of tasks!
sbatch: error: Batch job submission failed: Unspecified error

srun already handles this

4657ab94

Break up packed job user messages to prepend index. · a415b8f6

Nate Rini authored Apr 22, 2019

Was dumping this:
$ srun -A test7.21-account.1 --qos test7.21-qos.1 -n5 : -n3 : -n1 /bin/true
srun: error: 0: submit1
srun: error: submit2
srun: error: submit3
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Will now dump this:
$ srun -A test7.21-account.1 --qos test7.21-qos.1 -n5 : -n3 : -n1 /bin/true
srun: error: 0: initialized
srun: error: 0: submit1
srun: error: 0: submit2
srun: error: 0: submit3
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Bug 6513.

a415b8f6

Fix printing duplicate error messages of lua rejected jobs · 297a6880
Nate Rini authored Apr 22, 2019
```
Regression from 70b4e06d.

Bug 6892.
```
297a6880
Fix segfault when loading/unloading lua job submit plugin multiple times · 8920863a
Nate Rini authored Apr 22, 2019
```
Bug 6895.
```
8920863a
Allow submit plugins to be turned on and off with scontrol reconfig · a0e14237
Brian Christiansen authored Apr 25, 2019
```
Bug 6895
```
a0e14237
Fix unnecessary reloading of submit plugins · b50ac244
Brian Christiansen authored Apr 24, 2019
```
Bug 6895
```
b50ac244