Commits · 648735c5161e25cbe86cb5441d1d4c9f1beed7d8 · Manuel G. Marciani / ces_slurm_simulator

07 Jan, 2014 1 commit

If FastSchedule=0 just log low mem or disk · 648735c5

Morris Jette authored Jan 07, 2014

Do not mark the node DOWN if its memory or tmp disk space is lower
than configured, just log it using debug message type

648735c5

06 Jan, 2014 2 commits

Reset job priority on manual resume · 65d9196c

Morris Jette authored Jan 06, 2014

If a job is explicitly suspended, its priority is set to zero.
This resets the priority when requeued and also documents that
if the job is requeued (e.g. due to a node failure), then it
is placed in a held state.

65d9196c

Correct job RunTime if requeued from suspend state · bc3d8828

Morris Jette authored Jan 06, 2014

Without this patch, the job's RunTime includes its RunTime from
before it's prior suspend (i.e. the job's full RunTime rather than
just the RunTime of the requeued job).

bc3d8828

27 Dec, 2013 1 commit

Fix sched/backfill bug that could starve jobs · 2bae8bd6

Filip Skalski authored Dec 27, 2013

Hello,

I think I found another bug in the code (I'm using 2.6.3 but I checked the 2.6.5 and 14.03 versions and it's the same there).

In file sched/backfill/backfill.c:

1)
_add_reservation function, from lines 1172:

if (placed == true) {
        j = node_space[j].next;
        if (j && (end_reserve < node_space[j].end_time)) {
                /* insert end entry record */
                i = *node_space_recs;
                node_space[i].begin_time = end_reserve;
                node_space[i].end_time = node_space[j].end_time;
                node_space[j].end_time = end_reserve;
                node_space[i].avail_bitmap =
                        bit_copy(node_space[j].avail_bitmap);
                node_space[i].next = node_space[j].next;
                node_space[j].next = i;
                (*node_space_recs)++;
        }
        break;
}
I draw a picture with `node_space` state after 2 iterations (see attachment).

In case where the new reservation i...

2bae8bd6

23 Dec, 2013 2 commits
- Update news to start 2.6.6 · b281bb01
  Morris Jette authored Dec 23, 2013
  
  b281bb01
- Update AllocNodes paragraph in slurm.conf.5. · ad58005e
  David Bigagli authored Dec 23, 2013
  
  ad58005e
20 Dec, 2013 2 commits
- BGQ - Add midplane to the total_cnodes used in the runjob_mux plugin · 626d44e0
  Danny Auble authored Dec 20, 2013
```
for better debug
```
  626d44e0
- BGQ - Fix issue if user runs multiple sub-block jobs inside a multiple · c2af01cb
  Danny Auble authored Dec 20, 2013
```
midplane block that starts on a higher coordinate than it ends (i.e if a
block has midplanes [0010,0013] 0013 is the start even though it is
listed second in the hostlist).
```
  c2af01cb
19 Dec, 2013 1 commit

scontrol show job - Correct NumNodes value · b31e2176

Morris Jette authored Dec 19, 2013

It has been changed to improve the calculated value for pending
jobs and use the actual node count value for jobs that have been
started (including suspended, completed, etc.)
bug 549

b31e2176

18 Dec, 2013 1 commit
- BGQ - Fix logic to better handle when midplanes come back on line after · 6946b82e
  Danny Auble authored Dec 18, 2013
```
being in error.
```
  6946b82e
17 Dec, 2013 2 commits
- Validate return code from calls to slurm_get_peer_addr · f9ea996b
  Danny Auble authored Dec 16, 2013
  
  f9ea996b
- If a connection is reset while trying to talk to it slurm_get_peer_addr · d3aec398
  Danny Auble authored Dec 16, 2013
```
will return ENOTCONN and not initialize the addr_str causing valgrind
errors.
```
  d3aec398
16 Dec, 2013 1 commit
- scontrol to operate on mulitple jobs · ed3398f0
  Hughes, Doug authored Dec 16, 2013
```
This allows multiple job ids to hold, uhold, resume, suspend, release, etc.
```
  ed3398f0
14 Dec, 2013 1 commit
- Fix minor memory leak · b4015f90
  Danny Auble authored Dec 13, 2013
  
  b4015f90
13 Dec, 2013 2 commits

Fix erroneous error messages when running gang scheduling. · 206dc223
Danny Auble authored Dec 13, 2013

206dc223

Fix slurmstepd race condition causing abort · be703c47

Morris Jette authored Dec 13, 2013

Fix slurmstepd race condition when separate threads are reading and
modifying the job's environment, which can result in the slurmstepd failing
with an invalid memory reference. Observed at shutdown when trying
to run the task epilog and trying to read the env var:
SLURM_STEP_KILLED_MSG_NODE_ID

be703c47

12 Dec, 2013 1 commit

slurmstepd variable initialization · 06b41cdc

Morris Jette authored Dec 11, 2013

Without this patch, free() is called on a random memory location
(i.e. whatever is on the stack), which can result in slurmstepd
dying and a completed job not being purged in a timely fashion.

06b41cdc

11 Dec, 2013 2 commits
- HDF5 - Fix minor memory leak. · 960dd5cd
  Danny Auble authored Dec 11, 2013
  
  960dd5cd
- Fix race condition in auth cred create · 925bcfb6
  Morris Jette authored Dec 11, 2013
```
Fix race condition in authentication credential creation that could corrupt
memory. (NOTE: This race condition has existed since 2003 and would be
exceedingly rare.)
```
  925bcfb6
09 Dec, 2013 2 commits

Modify squeue to support longer job ID values · 17f27007
Morris Jette authored Dec 09, 2013
```
This is needed for job arrays with discontiguous task ID values
(e.g. "123_[1,3,5,...99999]")
```
17f27007

Improve sview support for job arrays · d998640f

Morris Jette authored Dec 09, 2013

Previously job arrays were only listed with their native job ID
(e.g. 123_0 listed as 123, 123_1 as 124, etc). Now lists the job ID
using both format (e.g. "123_1 (124)"). The same format is used
for job step IDs (e.g. "123_1.2 (124.2)").

d998640f

08 Dec, 2013 1 commit
- Describe previous commit in NEWS · b19bd476
  jette authored Dec 07, 2013
  
  b19bd476
07 Dec, 2013 2 commits
- Fix srun hang when IO fails to start at launch. · f98413b1
  Danny Auble authored Dec 06, 2013
  
  f98413b1
- Handle in the API if parts of the node structure are NULL. · 5d3b1d4e
  Philip D. Eckert authored Dec 06, 2013
  
  5d3b1d4e
06 Dec, 2013 2 commits
- Added ApbasilTimeout parameter to the cray.conf · 270f696e
  Trofinoff Stephen authored Dec 06, 2013
```
This adds a mechanism to kill a hung apbasil command
```
  270f696e
- Fix for hwloc returning zero core count · ec4df3bf
  Jason Bacon authored Dec 06, 2013
  
  ec4df3bf
05 Dec, 2013 1 commit
- Remove notice of CVE with very old/deprecated versions of Slurm in · 723fb063
  Danny Auble authored Dec 05, 2013
```
news.html.
```
  723fb063
04 Dec, 2013 1 commit
- jobcomp/filetxt - Reopen the file on SIGHUP · fd7c9360
  Morris Jette authored Dec 03, 2013
```
Previous logic never reopened the file, preventing proper functioning
of logrotate.
```
  fd7c9360
03 Dec, 2013 3 commits

Improve REQUEST_JOB_INFO_SINGLE RPC performance · 80d3b343
Morris Jette authored Dec 03, 2013
```
Use hash function to locate job records for improved performance.
```
80d3b343

Improve REQUEST_JOB_INFO_SINGLE RPC performance · 14bcfe58

Morris Jette authored Dec 03, 2013

Change partition write lock to a read lock as we use a different
mechanism for hidden partitions in getting individual jobs.

14bcfe58

Correct job dependency string · 08265c03

Morris Jette authored Dec 02, 2013

Correct logic returning remaining job dependencies in job information
reported by scontrol and squeue. Eliminates vestigial descriptors with
no job ID values (e.g. "afterany"). As depdencies are removed, the
job ID values were removed from the strings, but not the descriptors.
This eliminates both. It also checks the full job ID to make sure we do
not remove "afterany:1234" when job "123" completes.

08265c03

02 Dec, 2013 2 commits

fix race condition in batch exit code · 6d1d932b

Morris Jette authored Dec 02, 2013

Fix race condition on batch job termination that could result in a job exit
code of 0xfffffffe if the slurmd on node zero registers its active jobs at
the same time that slurmstepd is recording the job's exit code.
but 535

6d1d932b

Fixed sh5util loop when there are no node-step files. · fff922e9
David Bigagli authored Dec 02, 2013

fff922e9

29 Nov, 2013 2 commits

proctrack/cgroup - Add lock preventing race condition · 3f6d9e36

Morris Jette authored Nov 29, 2013

proctrack/cgroup - Add locking to prevent race condition where one job step
is ending for a user or job at the same time another job stepsis starting
and the user or job container is deleted from under the starting job step.
bug 447

3f6d9e36

Performance improvement for shared resources · 74d1a4b4

David Bigagli authored Nov 29, 2013

Substantial performance improvement for systems with Shared=YES or FORCE
and large numbers of running jobs (replace bubble sort with quick sort).
Bug 525

74d1a4b4

27 Nov, 2013 1 commit

Correct a job's GRES allocation data in accounting · 1ae427dd

Morris Jette authored Nov 27, 2013

Original code worked only for Cray systems. For other systems it
set gres_alloc to the total number of each GRES allocated on each
node to any job

1ae427dd

26 Nov, 2013 1 commit
- Add cpu_load to the node info available via Perl API · 6830d2ac
  Chris Scheller authored Nov 26, 2013
  
  6830d2ac
14 Nov, 2013 1 commit
- job_submit/lua - add cpus_per_task field · f7777cd5
  Morris Jette authored Nov 14, 2013
```
bug 511
```
  f7777cd5
13 Nov, 2013 1 commit

Corrections to advanced reservation logic with overlapping jobs. · d6954b77

Morris Jette authored Nov 13, 2013

This might have worked fine for core reservations or when there
are sufficient idle nodes to use, the the select_g_resv_test()
function clears the node bitmap for nodes that it can not use
and the reservation create logic did not restore that bitmap
after a failed resource selection attempt. This logic restores
the node bitmap on a failed call to select_g_resv_test() so we
can add nodes to the bitmap of available nodes rather than having
it repeatedly cleared.
The logic also adds some performance enhancements that I will
add to in the next commit.

d6954b77

08 Nov, 2013 1 commit
- Add news about oom check for task/cgroup and minor formatting · 51862f56
  Danny Auble authored Nov 07, 2013
  
  51862f56