- 31 Jan, 2014 3 commits
-
-
David Bigagli authored
-
Danny Auble authored
i.e. salloc -n32 doesn't request the number of nodes and with the previous code if this request used 4 nodes and only 1 was left in GrpNodes it would just run with no issue since we were checking things before we selected how many nodes it ran on. Now we check this afterwards so we always check the limits on how many nodes, cpus and how much memory is to be used.
-
Morris Jette authored
Fix step allocation when some CPUs are not available due to memory limits. This happens when one step is active and using memory that blocks the scheduling of another step on a portion of the CPUs needed. The new step is now delayed rather than aborting with "Requested node configuration is not available". bug 577
-
- 28 Jan, 2014 1 commit
-
-
Danny Auble authored
based on ionode count correctly on slurmctld restart.
-
- 23 Jan, 2014 2 commits
-
-
Danny Auble authored
connect in a loop instead of producing a fatal.
-
Danny Auble authored
-
- 21 Jan, 2014 2 commits
-
-
David Bigagli authored
-
David Bigagli authored
This reverts commit 2fa28eb6. Conflicts: NEWS
-
- 18 Jan, 2014 1 commit
-
-
David Bigagli authored
data correctly accumulating differences between sampling intervals. Fix the data structure mismatch between acct_gather_filesystem_lustre.c and slurm_jobacct_gather.h which caused the hdf5 plugin to log incorrect data.
-
- 16 Jan, 2014 2 commits
-
-
David Bigagli authored
the srun help.
-
David Bigagli authored
network traffic accounting plugin.
-
- 15 Jan, 2014 1 commit
-
-
Danny Auble authored
add/remove columns. caused by commit 68f0f5db
-
- 13 Jan, 2014 2 commits
-
-
Morris Jette authored
Do not reset a job's priority when the slurmctld restarts if previously set to some specific value. bug 561
-
John Morrissey authored
groups.
-
- 08 Jan, 2014 3 commits
-
-
David Bigagli authored
-
David Bigagli authored
This reverts commit 3464295e.
-
David Bigagli authored
-
- 07 Jan, 2014 2 commits
-
-
Danny Auble authored
-
Morris Jette authored
Do not mark the node DOWN if its memory or tmp disk space is lower than configured, just log it using debug message type
-
- 06 Jan, 2014 2 commits
-
-
Morris Jette authored
If a job is explicitly suspended, its priority is set to zero. This resets the priority when requeued and also documents that if the job is requeued (e.g. due to a node failure), then it is placed in a held state.
-
Morris Jette authored
Without this patch, the job's RunTime includes its RunTime from before it's prior suspend (i.e. the job's full RunTime rather than just the RunTime of the requeued job).
-
- 27 Dec, 2013 1 commit
-
-
Filip Skalski authored
Hello, I think I found another bug in the code (I'm using 2.6.3 but I checked the 2.6.5 and 14.03 versions and it's the same there). In file sched/backfill/backfill.c: 1) _add_reservation function, from lines 1172: if (placed == true) { j = node_space[j].next; if (j && (end_reserve < node_space[j].end_time)) { /* insert end entry record */ i = *node_space_recs; node_space[i].begin_time = end_reserve; node_space[i].end_time = node_space[j].end_time; node_space[j].end_time = end_reserve; node_space[i].avail_bitmap = bit_copy(node_space[j].avail_bitmap); node_space[i].next = node_space[j].next; node_space[j].next = i; (*node_space_recs)++; } break; } I draw a picture with `node_space` state after 2 iterations (see attachment). In case where the new reservation i...
-
- 23 Dec, 2013 2 commits
-
-
Morris Jette authored
-
David Bigagli authored
-
- 20 Dec, 2013 2 commits
-
-
Danny Auble authored
for better debug
-
Danny Auble authored
midplane block that starts on a higher coordinate than it ends (i.e if a block has midplanes [0010,0013] 0013 is the start even though it is listed second in the hostlist).
-
- 19 Dec, 2013 1 commit
-
-
Morris Jette authored
It has been changed to improve the calculated value for pending jobs and use the actual node count value for jobs that have been started (including suspended, completed, etc.) bug 549
-
- 18 Dec, 2013 1 commit
-
-
Danny Auble authored
being in error.
-
- 17 Dec, 2013 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
will return ENOTCONN and not initialize the addr_str causing valgrind errors.
-
- 16 Dec, 2013 1 commit
-
-
Hughes, Doug authored
This allows multiple job ids to hold, uhold, resume, suspend, release, etc.
-
- 14 Dec, 2013 1 commit
-
-
Danny Auble authored
-
- 13 Dec, 2013 2 commits
-
-
Danny Auble authored
-
Morris Jette authored
Fix slurmstepd race condition when separate threads are reading and modifying the job's environment, which can result in the slurmstepd failing with an invalid memory reference. Observed at shutdown when trying to run the task epilog and trying to read the env var: SLURM_STEP_KILLED_MSG_NODE_ID
-
- 12 Dec, 2013 1 commit
-
-
Morris Jette authored
Without this patch, free() is called on a random memory location (i.e. whatever is on the stack), which can result in slurmstepd dying and a completed job not being purged in a timely fashion.
-
- 11 Dec, 2013 2 commits
-
-
Danny Auble authored
-
Morris Jette authored
Fix race condition in authentication credential creation that could corrupt memory. (NOTE: This race condition has existed since 2003 and would be exceedingly rare.)
-
- 09 Dec, 2013 2 commits
-
-
Morris Jette authored
This is needed for job arrays with discontiguous task ID values (e.g. "123_[1,3,5,...99999]")
-
Morris Jette authored
Previously job arrays were only listed with their native job ID (e.g. 123_0 listed as 123, 123_1 as 124, etc). Now lists the job ID using both format (e.g. "123_1 (124)"). The same format is used for job step IDs (e.g. "123_1.2 (124.2)").
-
- 08 Dec, 2013 1 commit
-
-
jette authored
-