Commits · c11eed5c8cc519b2ce28585c0864457f19716a78 · Manuel G. Marciani / ces_slurm_simulator

28 Jun, 2019 1 commit

Cast reservation flags before sending to MySQL. · c11eed5c

Dominik Bartkiewicz authored May 08, 2019

Flags are stored in a smallint, which can only hold the first 16 bits
worth out of 32 bits of flags currently in use.

MySQL's overflow rules will treat any value > 0xffff as 0xffff, rather than
dropping the higher-order bits (flags), which means the stored value not
only loses the higher-order bits but corrupts the lower-order as well.

The 19.05 release extends the column to bigint (64 bit).

Bug 6969.

c11eed5c

07 Jun, 2019 2 commits
- Correct hetjob job count limit enforcement · fce37327
  Morris Jette authored Jun 06, 2019
```
For heterogeneous jobs, do not count the each component against the QOS or
association job limit multiple times.

bug 7190
```
  fce37327
- Add NEWS entry · a2605337
  Albert Gil authored Jun 05, 2019
```
Bug 6847
```
  a2605337
27 May, 2019 1 commit
- Fix seff human readable memory string for values below a megabyte. · 03165950
  Ross Dickson authored May 27, 2019
```
Bug 6466.
```
  03165950
25 May, 2019 1 commit

Fix error messages in _convert_to_name(). · 805aa366

Felip Moll authored May 23, 2019

The name variable hasn't been set yet, so this is always NULL. Print the
uid/gid instead. While here, treat uid/gid as uint32_t, and use strtoul()
rather than atoi() to avoid issues with high-number uid/gid values.

Fixes GCC 9 warning.

Bug 7101.

805aa366

24 May, 2019 2 commits
- Avoid flooding slurmctld and logging prolog complete RPC errors. · d5131780
  Nate Rini authored May 07, 2019
```
Use RETRY_DELAY per to mirror job complete delay but without a max retry
count for the time being.

Bug 6970.
```
  d5131780
- Fix minor memory leak when clearing runaway jobs. · e293204d
  Danny Auble authored May 23, 2019
```
Signed-off-by: Brian Christiansen <brian@schedmd.com>
```
  e293204d
23 May, 2019 9 commits
- Don't write "(null)" to event table · 63ac136a
  Brian Christiansen authored May 16, 2019
```
Bug 6964
```
  63ac136a
- Fix reboot failure message getting to event database · ee9db31c
  Brian Christiansen authored May 16, 2019
```
The reason was being set after the message was sent to the db. Also
clear the draing and reboot states before the message is sent so that
the event state will show DOWN.

Bug 6964
```
  ee9db31c
- Create node reboot event in database · f84f60d0
  Brian Christiansen authored May 16, 2019
```
Bug 6964
```
  f84f60d0
- On reboot ASAP clear node from avail_node_bitmap · 7fc23eff
  Brian Christiansen authored May 16, 2019
```
so that new jobs can't get on the node.

Bug 6964
```
  7fc23eff
- Prevent slurmctld from potential segfault after job_start_data() called · 3238e02d
  Dominik Bartkiewicz authored Apr 30, 2019
```
for completing job.

Bug 6927
```
  3238e02d
- Fix sacctmgr --parsable2 output for reservations and tres · 45bfc4dc
  Dominik Bartkiewicz authored Apr 30, 2019
```
Bug 6926
```
  45bfc4dc
- Make it so dependent jobs reset the AccrueTime and do not count against any AccrueTime limits. · c2bc255c
  Alejandro Sanchez authored May 22, 2019
```
Continuation of 89b791bf.

Bug 7045.
```
  c2bc255c
- Add new job bit_flags of JOB_DEPENDENT. · f591f0c9
  Alejandro Sanchez authored May 21, 2019
```
To indicate that a job is dependent or has an invalid dependency.
Not used for now, just added and removed according to its meaning.

Bug 7045.
```
  f591f0c9
- Fix issue with a 17.11 sbcast call to a 18.08 daemon. · 1d66b395
  Albert Gil authored May 23, 2019
```
Bug 7080
```
  1d66b395
22 May, 2019 2 commits

Use correct rank for cloud stepd's. · e7d4d593

Marshall Garey authored Apr 18, 2019

Job steps that run on cloud nodes and use the alias_list - in other
words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk
directly back to the slurmctld. To make that happen, we set the parent
tank of each stepd to -1. However, we also set the rank of each stepd to
0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to
the slurmctld, they would tell slurmctld to clean up node 0 in the step
allocation. So, multi-node step allocations weren't cleaning up after
the steps completed and would cause subsequent job steps to hang. The
step allocations would only clean up properly at the end of the job.

Ensure that each stepd uses the correct rank so that job steps are
properly cleaned up after each step completes.

Bug 6467.

e7d4d593

Move two NEWS entries to appropriate maintenance release. · 09a7da34
Alejandro Sanchez authored May 22, 2019
```
They were associated to these two commits:

b4d7de48
6871185a

Bug 5562.
```
09a7da34

21 May, 2019 3 commits

Correctly set unlimited sched_job_limit · 69621444

Dominik Bartkiewicz authored May 06, 2019



unlimited could get overwritten with default queue depth preventing the
whole queue from being looked at -- especially in a high-throughput
envrionment.

Bug 6822

Co-authored-by: Morris Jette <jette@schedmd.com>

69621444

cons_res/job_test - fix to consider a node's current allocated memory. · b4d7de48

Alejandro Sanchez authored Apr 11, 2019

Node memory overallocation wouldn't be properly detected since we would
just be interpreting the available memory as RealMemory - MemSpecLimit,
ignoring other job's memory usage.

Bug 5562.

b4d7de48

cons_res/job_test - prevent a job from overallocating a node memory. · 6871185a

Alejandro Sanchez authored Apr 11, 2019

This compares a job memory request against each selected node available
memory, interpreting the latter for now as RealMemory - MemSpecLimit.

Bug 5562.

6871185a

17 May, 2019 2 commits
- Fix NEWS from previous commit. · 438ffc1c
  Tim Wickberg authored May 16, 2019
```
This is select/cons_res, not select/cons_tres.
```
  438ffc1c
- Only allocate 1 CPU per node with the --overcommit and --nodelist options · 46197135
  Morris Jette authored May 10, 2019
```
Previous select/cons_res logic would allocate one CPU per task on the node

Bug 6981
```
  46197135
16 May, 2019 1 commit

Fix archive loading events. · 0d0f9deb

Marshall Garey authored May 15, 2019

There was a syntax error in the mysql for inserting the event records
into the event table caused by commit 3d61b6aa. The syntax error was
a semicolon in the middle of the query, for example:

insert into "voyager_event_table" (time_start, time_end, node_name,
cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
'1539298628', 'v1', '', 'cold-start', '1017', '0',
'1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...

Bug 7025.

0d0f9deb

13 May, 2019 1 commit
- Remove stray newlines in SPANK error messages. · 3c68e645
  Tim Wickberg authored May 13, 2019
  
  3c68e645
10 May, 2019 2 commits

Only archive 50k records at a time. · ddd49896

Marshall Garey authored Apr 24, 2019

Trying to archive too many records at once can result in archive files
that are too big to read or even too big to be written. Only archive 50k
records at a time, like we only purge 50k records at a time.

Bug 6033.

ddd49896

Handle duplicate archive file names. · 1e234c3d

Marshall Garey authored Apr 24, 2019

The time period of the archive file currently depends on submit or start
time and whether the purge period is in hours, days, or months.
Previously, if the archive file name already exists, we would overwrite
the old archive file with the assumption that these are duplicate
records being archived after an archive load. However, that could result
in lost records in a couple of ways:

  * If there were runaway jobs that were part of an old archive file's
  time period and are later fixed and then purged, the old file would
  be overwritten.
  * If jobs or steps are purged but there are still jobs or steps in
  that time period that are pending or running, the pending or running
  jobs and steps won't be purged. When they finish and are purged, the
  old file would be overwritten.

Instead of overwriting the old file, we append a number to the file name
to create a new file. This will also be important in an upcoming commit.

Bug 6033.

1e234c3d

06 May, 2019 1 commit

Fix seff memory display overflow · bab13dfd

Felip Moll authored Apr 15, 2019

When tres_usage_in_max field is empty it is recorded as '' in the database
which leads find_tres_count_in_string() to return an INFINITE64. Seff treats
INIFINITE64 as a valid value. This patch fixes this issue.

Bug 6817

bab13dfd

03 May, 2019 1 commit
- Free memory before exiting in sacctmgr_list_runaway_jobs(). · f56fc717
  Nate Rini authored May 03, 2019
```
Bug 6880/6952.
```
  f56fc717
02 May, 2019 2 commits

Fix resubmit to sibling default on fed requeue · 822fe77e

Broderick Gardner authored Apr 18, 2019

On requeue, the origin cluster job record is copied to submit
to sibling clusters. If the job was originally submitted
to accept cluster default account, partition, etc, those fields
are now filled in on the origin. Here we add flags to indicate
that those fields need to be cleared on resubmission to siblings.
Bug 6064

822fe77e

Fix clearing federation cluster lock on requeue · 47909f8e

Broderick Gardner authored Mar 25, 2019

This is a holdover from when the fed job_info list was added.
The cluster lock has to be cleared from both the job_ptr and
the job_info.
Bug 6064

47909f8e

30 Apr, 2019 1 commit
- Fix memory leak in group_cache.c · 876bd712
  Danny Auble authored Apr 30, 2019
```
Blessed by Tim.
```
  876bd712
29 Apr, 2019 5 commits
- Add NEWS for previous two commits · 00a8e724
  Brian Christiansen authored Apr 25, 2019
```
Bug 6513
```
  00a8e724
- Fix printing duplicate error messages of lua rejected jobs · 297a6880
  Nate Rini authored Apr 22, 2019
```
Regression from 70b4e06d.

Bug 6892.
```
  297a6880
- Fix segfault when loading/unloading lua job submit plugin multiple times · 8920863a
  Nate Rini authored Apr 22, 2019
```
Bug 6895.
```
  8920863a
- Allow submit plugins to be turned on and off with scontrol reconfig · a0e14237
  Brian Christiansen authored Apr 25, 2019
```
Bug 6895
```
  a0e14237
- Fix unnecessary reloading of submit plugins · b50ac244
  Brian Christiansen authored Apr 24, 2019
```
Bug 6895
```
  b50ac244
26 Apr, 2019 3 commits

Limit records per single SQL statement when loading archived data. · 34e9d41b

Nate Rini authored Apr 26, 2019



Otherwise, we could send communication packets bigger than max_allowed_packet.

Bug 6832.

Co-authored-by: Tim Wickberg <tim@schedmd.com>

34e9d41b

accounting_storage/mysql - fix memory leak in the archive load logic. · e8567e06
Alejandro Sanchez authored Apr 26, 2019
```
Regression introduced in 8d643e79.

Bug 6832.
```
e8567e06

accounting_storage/mysql - fix SIGABRT in the archive load logic. · e174e135

Nate Rini authored Apr 26, 2019

The problem was freeing an interior pointer to buffer contents before
the call to FREE_NULL_BUFFER. The issue was only triggered when loading
an archived data with protocol version < 17.11.

Regression introduced in 8d643e79.

Bug 6832.

e174e135