Commits · 7e7fd1bc649691516a357f012aa1415308d6f54e · Manuel G. Marciani / ces_slurm_simulator

10 May, 2019 7 commits

Document behavior of duplicate archive file names. · 7e7fd1bc
Marshall Garey authored Apr 25, 2019
```
Bug 6033.
```
7e7fd1bc

Prevent infinite loop if 0 records are archived. · df5f748d

Marshall Garey authored Apr 25, 2019

If _get_oldest_record() finds a record to archive/purge, then archive
should always archive at least one record. If for whatever reason it
fails to archive any records (_archive_table() returns a 0), then we
don't want call continue, but want to return an error. Calling continue
to go back to the beginning of the while loop would result in an
infinite loop.

Bug 6033.

df5f748d

Make archive job sql query consistent with purge. · 90471db8
Marshall Garey authored Apr 25, 2019
```
Bug 6033.
```
90471db8

Only archive 50k records at a time. · ddd49896

Marshall Garey authored Apr 24, 2019

Trying to archive too many records at once can result in archive files
that are too big to read or even too big to be written. Only archive 50k
records at a time, like we only purge 50k records at a time.

Bug 6033.

ddd49896

Handle duplicate archive file names. · 1e234c3d

Marshall Garey authored Apr 24, 2019

The time period of the archive file currently depends on submit or start
time and whether the purge period is in hours, days, or months.
Previously, if the archive file name already exists, we would overwrite
the old archive file with the assumption that these are duplicate
records being archived after an archive load. However, that could result
in lost records in a couple of ways:

  * If there were runaway jobs that were part of an old archive file's
  time period and are later fixed and then purged, the old file would
  be overwritten.
  * If jobs or steps are purged but there are still jobs or steps in
  that time period that are pending or running, the pending or running
  jobs and steps won't be purged. When they finish and are purged, the
  old file would be overwritten.

Instead of overwriting the old file, we append a number to the file name
to create a new file. This will also be important in an upcoming commit.

Bug 6033.

1e234c3d

Remove unused static variable high_buffer_size. · 3ffb4b4c
Marshall Garey authored Apr 24, 2019
```
It was set but never read.

Bug 6033.
```
3ffb4b4c

Use correct signed/unsiged types. · 4a26e486

Marshall Garey authored Apr 23, 2019

Change a few variables in archiving to use the correct signed or
unsigned type to avoid implicit casting.

Bug 6033.

4a26e486

09 May, 2019 1 commit
- mpi/pmix - remove unused _pmixp_pp_iter_count variable. · 68b1f5aa
  Broderick Gardner authored May 09, 2019
```
Bug 6799.
```
  68b1f5aa
08 May, 2019 1 commit

Renumber newly added flags to avoid a conflict in 19.05. · 26ccbec1

Tim Wickberg authored May 08, 2019

These conflict with JOB_MEM_SET/JOB_RESIZED in 19.05. Since 19.05rc1
has shipped - but no 18.08 maintenance releases have shipped with these
new flags - it is safer to renumber them here to avoid the merge conflict
going into 19.05.

Bug 6064.

26ccbec1

06 May, 2019 1 commit

Fix seff memory display overflow · bab13dfd

Felip Moll authored Apr 15, 2019

When tres_usage_in_max field is empty it is recorded as '' in the database
which leads find_tres_count_in_string() to return an INFINITE64. Seff treats
INIFINITE64 as a valid value. This patch fixes this issue.

Bug 6817

bab13dfd

03 May, 2019 1 commit
- Free memory before exiting in sacctmgr_list_runaway_jobs(). · f56fc717
  Nate Rini authored May 03, 2019
```
Bug 6880/6952.
```
  f56fc717
02 May, 2019 3 commits

Copy job_ptr->bit_flags to job_desc->bitflags · 2628d3dc
Broderick Gardner authored Apr 25, 2019
```
Bug 6064
```
2628d3dc

Fix resubmit to sibling default on fed requeue · 822fe77e

Broderick Gardner authored Apr 18, 2019

On requeue, the origin cluster job record is copied to submit
to sibling clusters. If the job was originally submitted
to accept cluster default account, partition, etc, those fields
are now filled in on the origin. Here we add flags to indicate
that those fields need to be cleared on resubmission to siblings.
Bug 6064

822fe77e

Fix clearing federation cluster lock on requeue · 47909f8e

Broderick Gardner authored Mar 25, 2019

This is a holdover from when the fed job_info list was added.
The cluster lock has to be cleared from both the job_ptr and
the job_info.
Bug 6064

47909f8e

30 Apr, 2019 1 commit
- Fix memory leak in group_cache.c · 876bd712
  Danny Auble authored Apr 30, 2019
```
Blessed by Tim.
```
  876bd712
29 Apr, 2019 10 commits

Update test7.20 to catch passing/failing het jobs · 8c4fdffe
Brian Christiansen authored Apr 29, 2019
```
when one offset passes and other fails.

Bug 6892
```
8c4fdffe
Add test7.20 · 1460a6b5
Nate Rini authored Mar 20, 2019
```
Bug 6513.
```
1460a6b5
Add NEWS for previous two commits · 00a8e724
Brian Christiansen authored Apr 25, 2019
```
Bug 6513
```
00a8e724

Fix bad sbatch het offset output · 4657ab94

Brian Christiansen authored Apr 24, 2019

Bug 6513

First offset is good but second is bad -- didn't request task count.

$ cat etc/job_submit.lua
function slurm_job_submit(job_desc, part_list, submit_uid)
        slurm.log_user("submit1\nstuff")
        slurm.log_user("submit2")
        slurm.log_user("submit3")

    -- slurm.log_user("case 0")
    if job_desc.num_tasks == slurm.NO_VAL or job_desc.num_tasks == nil then
        slurm.log_user("Batch submit error:  Must specify either number of nodes or number of tasks!")
        -- reject the job
        return slurm.ERROR
    end

        return slurm.SUCCESS
end

function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
        slurm.log_user("modify1")
        slurm.log_user("modify2")
        slurm.log_user("modify3")
        return slurm.SUCCESS
end

slurm.log_user("initialized")
return slurm.SUCCESS

$ sbatch -Ablah2 -n1 --wrap="hostname" : -J asdfl
sbatch: error: 0: initialized
sbatch: error: 0: submit1
sbatch: error: 0: stuff
sbatch: error: 0: submit2
sbatch: error: 0: submit3
sbatch: error: submit1
sbatch: error: stuff
sbatch: error: submit2
sbatch: error: submit3
sbatch: error: Batch submit error:  Must specify either number of nodes or number of tasks!
sbatch: error: Batch job submission failed: Unspecified error

$ sbatch -Ablah2 -n1 --wrap="hostname" : -J asdfl
sbatch: error: 0: initialized
sbatch: error: 0: submit1
sbatch: error: 0: stuff
sbatch: error: 0: submit2
sbatch: error: 0: submit3
sbatch: error: 1: submit1
sbatch: error: 1: stuff
sbatch: error: 1: submit2
sbatch: error: 1: submit3
sbatch: error: 1: Batch submit error:  Must specify either number of nodes or number of tasks!
sbatch: error: Batch job submission failed: Unspecified error

srun already handles this

4657ab94

Break up packed job user messages to prepend index. · a415b8f6

Nate Rini authored Apr 22, 2019

Was dumping this:
$ srun -A test7.21-account.1 --qos test7.21-qos.1 -n5 : -n3 : -n1 /bin/true
srun: error: 0: submit1
srun: error: submit2
srun: error: submit3
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Will now dump this:
$ srun -A test7.21-account.1 --qos test7.21-qos.1 -n5 : -n3 : -n1 /bin/true
srun: error: 0: initialized
srun: error: 0: submit1
srun: error: 0: submit2
srun: error: 0: submit3
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Bug 6513.

a415b8f6

Fix printing duplicate error messages of lua rejected jobs · 297a6880
Nate Rini authored Apr 22, 2019
```
Regression from 70b4e06d.

Bug 6892.
```
297a6880
Fix segfault when loading/unloading lua job submit plugin multiple times · 8920863a
Nate Rini authored Apr 22, 2019
```
Bug 6895.
```
8920863a
Allow submit plugins to be turned on and off with scontrol reconfig · a0e14237
Brian Christiansen authored Apr 25, 2019
```
Bug 6895
```
a0e14237
Fix unnecessary reloading of submit plugins · b50ac244
Brian Christiansen authored Apr 24, 2019
```
Bug 6895
```
b50ac244
Run autogen.sh with new automake · 7469e9c7
Danny Auble authored Apr 29, 2019

7469e9c7

26 Apr, 2019 9 commits
- expect test fixes -- race conditions · 8b1c7775
  Marshall Garey authored Apr 26, 2019
```
Bug 6215
```
  8b1c7775
- Docs - clarify Slurm versioning schema in quickstart_admin.html. · b38526b9
  Marshall Garey authored Apr 25, 2019
```
Change references to the "micro" release in rpc.html and troubleshoot.html
as well; SchedMD refers to the last part of the version number as the
"maintenance" release.

Bug 6833.
```
  b38526b9
- Hide archive load query log message also behind DB_QUERY. · 056e0bb8
  Alejandro Sanchez authored Apr 26, 2019
```
Bug 6832.
```
  056e0bb8
- Add debug logging at the end of an archive load attempt. · 380e935e
  Nate Rini authored Apr 26, 2019
```
Bug 6832.
```
  380e935e
- Assert that archive loading is always done with autocommit=0. · a10fc71c
  Nate Rini authored Apr 26, 2019
```
Bug 6832.
```
  a10fc71c
- Use goto cleanup instead of repeating cleanup code. · 650a91a9
  Nate Rini authored Apr 26, 2019
```
No functional change.

Bug 6832.
```
  650a91a9
- Limit records per single SQL statement when loading archived data. · 34e9d41b
  Nate Rini authored Apr 26, 2019
```
Otherwise, we could send communication packets bigger than max_allowed_packet.

Bug 6832.

Co-authored-by: Tim Wickberg <tim@schedmd.com>
```
  34e9d41b
- accounting_storage/mysql - fix memory leak in the archive load logic. · e8567e06
  Alejandro Sanchez authored Apr 26, 2019
```
Regression introduced in 8d643e79.

Bug 6832.
```
  e8567e06
- accounting_storage/mysql - fix SIGABRT in the archive load logic. · e174e135
  Nate Rini authored Apr 26, 2019
```
The problem was freeing an interior pointer to buffer contents before
the call to FREE_NULL_BUFFER. The issue was only triggered when loading
an archived data with protocol version < 17.11.

Regression introduced in 8d643e79.

Bug 6832.
```
  e174e135
24 Apr, 2019 4 commits
- Fix issue with backfill scheduler scheduling tasks of an array · 70d12f07
  Moe Jette authored Apr 24, 2019
```
when not the head job.

Bug 6837

For more in depth explanation see comment 24
```
  70d12f07
- Improve test12.10 for testing non-eligible jobs · 6015bffe
  Albert Gil authored Apr 22, 2019
```
Bug 6873
```
  6015bffe
- Fix non-eligible jobs with sacct -j and not -s. · b3e46057
  Albert Gil authored Apr 22, 2019
```
When specifying -j and not -s then non-eligible jobs will be shown
by sacct.
Time windows can also be used with -S and -E.
If --state is also used then non-eligible jobs won't be shown,
because non-eligible are not actually PD.

Bug 6873

# Conflicts:
#	NEWS
```
  b3e46057
- Fixed issue with jobs not appearing in sacct after dependency satisfied. · 5e23ae2b
  Ben Roberts authored Apr 23, 2019
```
Bug 6805
```
  5e23ae2b
23 Apr, 2019 2 commits
- Fix potential deadlock with backup slurmctld. · c04a488d
  Danny Auble authored Apr 22, 2019
```
Bug 6898
```
  c04a488d
- Add missing information for test12.10 · 316f0305
  Danny Auble authored Apr 23, 2019
```
Continuation of commit cc153e03
```
  316f0305