Commits · 6ad81fcd6b1c03cf704644c7c3fed88b2d939b1d · Manuel G. Marciani / ces_slurm_simulator

16 Jan, 2014 1 commit
- Fix slurmstepd lock when job terminates inside the infiniband · 6ad81fcd
  David Bigagli authored Jan 16, 2014
```
network traffic accounting plugin.
```
  6ad81fcd
15 Jan, 2014 1 commit
- sview - Fix regression where the Node tab wasn't able to · 977ba133
  Danny Auble authored Jan 15, 2014
```
add/remove columns.

caused by commit 68f0f5db
```
  977ba133
13 Jan, 2014 2 commits
- Do not reset priority of job if explicitly set · 470fcc51
  Morris Jette authored Jan 13, 2014
```
Do not reset a job's priority when the slurmctld restarts if previously
set to some specific value.
bug 561
```
  470fcc51
- Increase the PW_BUF_SIZE so the getgrnam_r() can process large user · 2fa28eb6
  John Morrissey authored Jan 12, 2014
```
groups.
```
  2fa28eb6
08 Jan, 2014 3 commits
- Update sshare.1 man page making it consistent with sacctmgr.1. · 047f1e0c
  David Bigagli authored Jan 08, 2014
  
  047f1e0c
- Revert "Update the sshare.1 man page making it consistent with sacctmgr.1." · 578271e8
  David Bigagli authored Jan 08, 2014
```
This reverts commit 3464295e.
```
  578271e8
- Update the sshare.1 man page making it consistent with sacctmgr.1. · 3464295e
  David Bigagli authored Jan 08, 2014
  
  3464295e
07 Jan, 2014 2 commits
- Remove vestigial note. · dcb67076
  Danny Auble authored Jan 07, 2014
  
  dcb67076
- If FastSchedule=0 just log low mem or disk · 648735c5
  Morris Jette authored Jan 07, 2014
```
Do not mark the node DOWN if its memory or tmp disk space is lower
than configured, just log it using debug message type
```
  648735c5
06 Jan, 2014 2 commits

Reset job priority on manual resume · 65d9196c

Morris Jette authored Jan 06, 2014

If a job is explicitly suspended, its priority is set to zero.
This resets the priority when requeued and also documents that
if the job is requeued (e.g. due to a node failure), then it
is placed in a held state.

65d9196c

Correct job RunTime if requeued from suspend state · bc3d8828

Morris Jette authored Jan 06, 2014

Without this patch, the job's RunTime includes its RunTime from
before it's prior suspend (i.e. the job's full RunTime rather than
just the RunTime of the requeued job).

bc3d8828

27 Dec, 2013 1 commit

Fix sched/backfill bug that could starve jobs · 2bae8bd6

Filip Skalski authored Dec 27, 2013

Hello,

I think I found another bug in the code (I'm using 2.6.3 but I checked the 2.6.5 and 14.03 versions and it's the same there).

In file sched/backfill/backfill.c:

1)
_add_reservation function, from lines 1172:

if (placed == true) {
        j = node_space[j].next;
        if (j && (end_reserve < node_space[j].end_time)) {
                /* insert end entry record */
                i = *node_space_recs;
                node_space[i].begin_time = end_reserve;
                node_space[i].end_time = node_space[j].end_time;
                node_space[j].end_time = end_reserve;
                node_space[i].avail_bitmap =
                        bit_copy(node_space[j].avail_bitmap);
                node_space[i].next = node_space[j].next;
                node_space[j].next = i;
                (*node_space_recs)++;
        }
        break;
}
I draw a picture with `node_space` state after 2 iterations (see attachment).

In case where the new reservation is fully inside another reservation,
then everything is OK.
But if the new reservation spans multiple existing reservations then the `end entry record` is not created.
This is because only the newly created `start entry record` is checked.

Easy fix would be to change the if into a loop, for example:

if (placed == true) {
    while((j = node_space[j].next) > 0) {
        if (end_reserve < node_space[j].end_time) {
           //same as above
           break;
        }
    }
    break;
}

2)
You could also change line 612:
        node_space = xmalloc(sizeof(node_space_map_t) *
                             (max_backfill_job_cnt + 3));
To `(max_backfill_job_cnt * 2 + 1)` , since each reservation can add at most two entries (check at line 982 should never execute). At the moment, in a worst case scenario this only checks half of the max_backfill_job_cnt.

NOTE: However this is all based on the assumption, that it is not done on purpose to speed up the calculations and trading some of the accuracy (especially point 2).

Best regards,
Filip Skalski

2bae8bd6

23 Dec, 2013 2 commits
- Update news to start 2.6.6 · b281bb01
  Morris Jette authored Dec 23, 2013
  
  b281bb01
- Update AllocNodes paragraph in slurm.conf.5. · ad58005e
  David Bigagli authored Dec 23, 2013
  
  ad58005e
20 Dec, 2013 2 commits
- BGQ - Add midplane to the total_cnodes used in the runjob_mux plugin · 626d44e0
  Danny Auble authored Dec 20, 2013
```
for better debug
```
  626d44e0
- BGQ - Fix issue if user runs multiple sub-block jobs inside a multiple · c2af01cb
  Danny Auble authored Dec 20, 2013
```
midplane block that starts on a higher coordinate than it ends (i.e if a
block has midplanes [0010,0013] 0013 is the start even though it is
listed second in the hostlist).
```
  c2af01cb
19 Dec, 2013 1 commit

scontrol show job - Correct NumNodes value · b31e2176

Morris Jette authored Dec 19, 2013

It has been changed to improve the calculated value for pending
jobs and use the actual node count value for jobs that have been
started (including suspended, completed, etc.)
bug 549

b31e2176

18 Dec, 2013 1 commit
- BGQ - Fix logic to better handle when midplanes come back on line after · 6946b82e
  Danny Auble authored Dec 18, 2013
```
being in error.
```
  6946b82e
17 Dec, 2013 2 commits
- Validate return code from calls to slurm_get_peer_addr · f9ea996b
  Danny Auble authored Dec 16, 2013
  
  f9ea996b
- If a connection is reset while trying to talk to it slurm_get_peer_addr · d3aec398
  Danny Auble authored Dec 16, 2013
```
will return ENOTCONN and not initialize the addr_str causing valgrind
errors.
```
  d3aec398
16 Dec, 2013 1 commit
- scontrol to operate on mulitple jobs · ed3398f0
  Hughes, Doug authored Dec 16, 2013
```
This allows multiple job ids to hold, uhold, resume, suspend, release, etc.
```
  ed3398f0
14 Dec, 2013 1 commit
- Fix minor memory leak · b4015f90
  Danny Auble authored Dec 13, 2013
  
  b4015f90
13 Dec, 2013 2 commits

Fix erroneous error messages when running gang scheduling. · 206dc223
Danny Auble authored Dec 13, 2013

206dc223

Fix slurmstepd race condition causing abort · be703c47

Morris Jette authored Dec 13, 2013

Fix slurmstepd race condition when separate threads are reading and
modifying the job's environment, which can result in the slurmstepd failing
with an invalid memory reference. Observed at shutdown when trying
to run the task epilog and trying to read the env var:
SLURM_STEP_KILLED_MSG_NODE_ID

be703c47

12 Dec, 2013 1 commit

slurmstepd variable initialization · 06b41cdc

Morris Jette authored Dec 11, 2013

Without this patch, free() is called on a random memory location
(i.e. whatever is on the stack), which can result in slurmstepd
dying and a completed job not being purged in a timely fashion.

06b41cdc

11 Dec, 2013 2 commits
- HDF5 - Fix minor memory leak. · 960dd5cd
  Danny Auble authored Dec 11, 2013
  
  960dd5cd
- Fix race condition in auth cred create · 925bcfb6
  Morris Jette authored Dec 11, 2013
```
Fix race condition in authentication credential creation that could corrupt
memory. (NOTE: This race condition has existed since 2003 and would be
exceedingly rare.)
```
  925bcfb6
09 Dec, 2013 2 commits

Modify squeue to support longer job ID values · 17f27007
Morris Jette authored Dec 09, 2013
```
This is needed for job arrays with discontiguous task ID values
(e.g. "123_[1,3,5,...99999]")
```
17f27007

Improve sview support for job arrays · d998640f

Morris Jette authored Dec 09, 2013

Previously job arrays were only listed with their native job ID
(e.g. 123_0 listed as 123, 123_1 as 124, etc). Now lists the job ID
using both format (e.g. "123_1 (124)"). The same format is used
for job step IDs (e.g. "123_1.2 (124.2)").

d998640f

08 Dec, 2013 1 commit
- Describe previous commit in NEWS · b19bd476
  jette authored Dec 07, 2013
  
  b19bd476
07 Dec, 2013 2 commits
- Fix srun hang when IO fails to start at launch. · f98413b1
  Danny Auble authored Dec 06, 2013
  
  f98413b1
- Handle in the API if parts of the node structure are NULL. · 5d3b1d4e
  Philip D. Eckert authored Dec 06, 2013
  
  5d3b1d4e
06 Dec, 2013 2 commits
- Added ApbasilTimeout parameter to the cray.conf · 270f696e
  Trofinoff Stephen authored Dec 06, 2013
```
This adds a mechanism to kill a hung apbasil command
```
  270f696e
- Fix for hwloc returning zero core count · ec4df3bf
  Jason Bacon authored Dec 06, 2013
  
  ec4df3bf
05 Dec, 2013 1 commit
- Remove notice of CVE with very old/deprecated versions of Slurm in · 723fb063
  Danny Auble authored Dec 05, 2013
```
news.html.
```
  723fb063
04 Dec, 2013 1 commit
- jobcomp/filetxt - Reopen the file on SIGHUP · fd7c9360
  Morris Jette authored Dec 03, 2013
```
Previous logic never reopened the file, preventing proper functioning
of logrotate.
```
  fd7c9360
03 Dec, 2013 3 commits

Improve REQUEST_JOB_INFO_SINGLE RPC performance · 80d3b343
Morris Jette authored Dec 03, 2013
```
Use hash function to locate job records for improved performance.
```
80d3b343

Improve REQUEST_JOB_INFO_SINGLE RPC performance · 14bcfe58

Morris Jette authored Dec 03, 2013

Change partition write lock to a read lock as we use a different
mechanism for hidden partitions in getting individual jobs.

14bcfe58

Correct job dependency string · 08265c03

Morris Jette authored Dec 02, 2013

Correct logic returning remaining job dependencies in job information
reported by scontrol and squeue. Eliminates vestigial descriptors with
no job ID values (e.g. "afterany"). As depdencies are removed, the
job ID values were removed from the strings, but not the descriptors.
This eliminates both. It also checks the full job ID to make sure we do
not remove "afterany:1234" when job "123" completes.

08265c03

02 Dec, 2013 1 commit

fix race condition in batch exit code · 6d1d932b

Morris Jette authored Dec 02, 2013

Fix race condition on batch job termination that could result in a job exit
code of 0xfffffffe if the slurmd on node zero registers its active jobs at
the same time that slurmstepd is recording the job's exit code.
but 535

6d1d932b