Commits · a7f149be5d31a306af75425f26e4b45ceb358030 · Manuel G. Marciani / ces_slurm_simulator

23 May, 2019 4 commits
- Fix sacctmgr --parsable2 output for reservations and tres · 45bfc4dc
  Dominik Bartkiewicz authored Apr 30, 2019
```
Bug 6926
```
  45bfc4dc
- Make it so dependent jobs reset the AccrueTime and do not count against any AccrueTime limits. · c2bc255c
  Alejandro Sanchez authored May 22, 2019
```
Continuation of 89b791bf.

Bug 7045.
```
  c2bc255c
- Add new job bit_flags of JOB_DEPENDENT. · f591f0c9
  Alejandro Sanchez authored May 21, 2019
```
To indicate that a job is dependent or has an invalid dependency.
Not used for now, just added and removed according to its meaning.

Bug 7045.
```
  f591f0c9
- Fix issue with a 17.11 sbcast call to a 18.08 daemon. · 1d66b395
  Albert Gil authored May 23, 2019
```
Bug 7080
```
  1d66b395
22 May, 2019 2 commits

Use correct rank for cloud stepd's. · e7d4d593

Marshall Garey authored Apr 18, 2019

Job steps that run on cloud nodes and use the alias_list - in other
words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk
directly back to the slurmctld. To make that happen, we set the parent
tank of each stepd to -1. However, we also set the rank of each stepd to
0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to
the slurmctld, they would tell slurmctld to clean up node 0 in the step
allocation. So, multi-node step allocations weren't cleaning up after
the steps completed and would cause subsequent job steps to hang. The
step allocations would only clean up properly at the end of the job.

Ensure that each stepd uses the correct rank so that job steps are
properly cleaned up after each step completes.

Bug 6467.

e7d4d593

Move two NEWS entries to appropriate maintenance release. · 09a7da34
Alejandro Sanchez authored May 22, 2019
```
They were associated to these two commits:

b4d7de48
6871185a

Bug 5562.
```
09a7da34

21 May, 2019 3 commits

Correctly set unlimited sched_job_limit · 69621444

Dominik Bartkiewicz authored May 06, 2019



unlimited could get overwritten with default queue depth preventing the
whole queue from being looked at -- especially in a high-throughput
envrionment.

Bug 6822

Co-authored-by: Morris Jette <jette@schedmd.com>

69621444

cons_res/job_test - fix to consider a node's current allocated memory. · b4d7de48

Alejandro Sanchez authored Apr 11, 2019

Node memory overallocation wouldn't be properly detected since we would
just be interpreting the available memory as RealMemory - MemSpecLimit,
ignoring other job's memory usage.

Bug 5562.

b4d7de48

cons_res/job_test - prevent a job from overallocating a node memory. · 6871185a

Alejandro Sanchez authored Apr 11, 2019

This compares a job memory request against each selected node available
memory, interpreting the latter for now as RealMemory - MemSpecLimit.

Bug 5562.

6871185a

17 May, 2019 2 commits
- Fix NEWS from previous commit. · 438ffc1c
  Tim Wickberg authored May 16, 2019
```
This is select/cons_res, not select/cons_tres.
```
  438ffc1c
- Only allocate 1 CPU per node with the --overcommit and --nodelist options · 46197135
  Morris Jette authored May 10, 2019
```
Previous select/cons_res logic would allocate one CPU per task on the node

Bug 6981
```
  46197135
16 May, 2019 1 commit

Fix archive loading events. · 0d0f9deb

Marshall Garey authored May 15, 2019

There was a syntax error in the mysql for inserting the event records
into the event table caused by commit 3d61b6aa. The syntax error was
a semicolon in the middle of the query, for example:

insert into "voyager_event_table" (time_start, time_end, node_name,
cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
'1539298628', 'v1', '', 'cold-start', '1017', '0',
'1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...

Bug 7025.

0d0f9deb

13 May, 2019 1 commit
- Remove stray newlines in SPANK error messages. · 3c68e645
  Tim Wickberg authored May 13, 2019
  
  3c68e645
10 May, 2019 2 commits

Only archive 50k records at a time. · ddd49896

Marshall Garey authored Apr 24, 2019

Trying to archive too many records at once can result in archive files
that are too big to read or even too big to be written. Only archive 50k
records at a time, like we only purge 50k records at a time.

Bug 6033.

ddd49896

Handle duplicate archive file names. · 1e234c3d

Marshall Garey authored Apr 24, 2019

The time period of the archive file currently depends on submit or start
time and whether the purge period is in hours, days, or months.
Previously, if the archive file name already exists, we would overwrite
the old archive file with the assumption that these are duplicate
records being archived after an archive load. However, that could result
in lost records in a couple of ways:

  * If there were runaway jobs that were part of an old archive file's
  time period and are later fixed and then purged, the old file would
  be overwritten.
  * If jobs or steps are purged but there are still jobs or steps in
  that time period that are pending or running, the pending or running
  jobs and steps won't be purged. When they finish and are purged, the
  old file would be overwritten.

Instead of overwriting the old file, we append a number to the file name
to create a new file. This will also be important in an upcoming commit.

Bug 6033.

1e234c3d

06 May, 2019 1 commit

Fix seff memory display overflow · bab13dfd

Felip Moll authored Apr 15, 2019

When tres_usage_in_max field is empty it is recorded as '' in the database
which leads find_tres_count_in_string() to return an INFINITE64. Seff treats
INIFINITE64 as a valid value. This patch fixes this issue.

Bug 6817

bab13dfd

03 May, 2019 1 commit
- Free memory before exiting in sacctmgr_list_runaway_jobs(). · f56fc717
  Nate Rini authored May 03, 2019
```
Bug 6880/6952.
```
  f56fc717
02 May, 2019 2 commits

Fix resubmit to sibling default on fed requeue · 822fe77e

Broderick Gardner authored Apr 18, 2019

On requeue, the origin cluster job record is copied to submit
to sibling clusters. If the job was originally submitted
to accept cluster default account, partition, etc, those fields
are now filled in on the origin. Here we add flags to indicate
that those fields need to be cleared on resubmission to siblings.
Bug 6064

822fe77e

Fix clearing federation cluster lock on requeue · 47909f8e

Broderick Gardner authored Mar 25, 2019

This is a holdover from when the fed job_info list was added.
The cluster lock has to be cleared from both the job_ptr and
the job_info.
Bug 6064

47909f8e

30 Apr, 2019 1 commit
- Fix memory leak in group_cache.c · 876bd712
  Danny Auble authored Apr 30, 2019
```
Blessed by Tim.
```
  876bd712
29 Apr, 2019 5 commits
- Add NEWS for previous two commits · 00a8e724
  Brian Christiansen authored Apr 25, 2019
```
Bug 6513
```
  00a8e724
- Fix printing duplicate error messages of lua rejected jobs · 297a6880
  Nate Rini authored Apr 22, 2019
```
Regression from 70b4e06d.

Bug 6892.
```
  297a6880
- Fix segfault when loading/unloading lua job submit plugin multiple times · 8920863a
  Nate Rini authored Apr 22, 2019
```
Bug 6895.
```
  8920863a
- Allow submit plugins to be turned on and off with scontrol reconfig · a0e14237
  Brian Christiansen authored Apr 25, 2019
```
Bug 6895
```
  a0e14237
- Fix unnecessary reloading of submit plugins · b50ac244
  Brian Christiansen authored Apr 24, 2019
```
Bug 6895
```
  b50ac244
26 Apr, 2019 3 commits

Limit records per single SQL statement when loading archived data. · 34e9d41b

Nate Rini authored Apr 26, 2019



Otherwise, we could send communication packets bigger than max_allowed_packet.

Bug 6832.

Co-authored-by: Tim Wickberg <tim@schedmd.com>

34e9d41b

accounting_storage/mysql - fix memory leak in the archive load logic. · e8567e06
Alejandro Sanchez authored Apr 26, 2019
```
Regression introduced in 8d643e79.

Bug 6832.
```
e8567e06

accounting_storage/mysql - fix SIGABRT in the archive load logic. · e174e135

Nate Rini authored Apr 26, 2019

The problem was freeing an interior pointer to buffer contents before
the call to FREE_NULL_BUFFER. The issue was only triggered when loading
an archived data with protocol version < 17.11.

Regression introduced in 8d643e79.

Bug 6832.

e174e135

24 Apr, 2019 3 commits

Fix issue with backfill scheduler scheduling tasks of an array · 70d12f07
Moe Jette authored Apr 24, 2019
```
when not the head job.

Bug 6837

For more in depth explanation see comment 24
```
70d12f07

Fix non-eligible jobs with sacct -j and not -s. · b3e46057

Albert Gil authored Apr 22, 2019

When specifying -j and not -s then non-eligible jobs will be shown
by sacct.
Time windows can also be used with -S and -E.
If --state is also used then non-eligible jobs won't be shown,
because non-eligible are not actually PD.

Bug 6873

# Conflicts:
#	NEWS

b3e46057

Fixed issue with jobs not appearing in sacct after dependency satisfied. · 5e23ae2b
Ben Roberts authored Apr 23, 2019
```
Bug 6805
```
5e23ae2b

23 Apr, 2019 2 commits
- Fix potential deadlock with backup slurmctld. · c04a488d
  Danny Auble authored Apr 22, 2019
```
Bug 6898
```
  c04a488d
- Fix sacct -PD to avoid CA before start jobs. · fadd6763
  Albert Gil authored Apr 22, 2019
```
Cancelled jobs before start have time_start=0 in the DB, but their
time_end!=0 (the cancel time).
Query fixed to handle properly these cases for -s PD.

Bug 6894
```
  fadd6763
22 Apr, 2019 1 commit

Sync "numtask" squeue format option for jobs and steps to "numtasks" · 6f686863

Ben Roberts authored Apr 16, 2019

Brought the option in line with the "numtasks" option you use when
not specifying --step. Preserved backwards compatbility for "numtask"
for steps. Also upated docs and expect test.

Bug 6861

6f686863

18 Apr, 2019 4 commits
- Continuation of last commit... · 31e297b9
  Dominik Bartkiewicz authored Apr 13, 2019
```
Properly initialize structures throughout Slurm.

Bug 6613
```
  31e297b9
- Properly initialize a few structures in the api. · 4728d0e8
  Danny Auble authored Mar 27, 2019
```
Bug 6613
```
  4728d0e8
- Correctly typecast signals being sent through the api. · 7fa46453
  Dominik Bartkiewicz authored Mar 27, 2019
```
Bug 6613
```
  7fa46453
- Restore '-T ALL' functionality in sreport. · 4b7113da
  Tim Wickberg authored Apr 18, 2019
```
Regression from aca37654

.

Bug 6826.

Co-authored-by: Chad Vizino <chad@schedmd.com>
```
  4b7113da
16 Apr, 2019 2 commits
- Don't reset a FAIL_QOS or FAIL_ACCOUNT job reason until the qos or account changed. · 39a6d267
  Danny Auble authored Mar 11, 2019
```
These are conditions that need to remain constant until something
changes on the job to reevaluate.

Bug 6625
```
  39a6d267
- When changing a jobs account/qos always make sure you remove the old limits. · a54c5edb
  Danny Auble authored Mar 11, 2019
```
What was happening here is you had to not be >= operator to have the old
limits removed.  This makes it so it always happens.

Bug 6625
```
  a54c5edb