- 03 Jun, 2013 1 commit
-
-
Hongjia Cao authored
We're having some trouble getting our slurm jobs to successfully restart after a checkpoint. For this test, I'm using sbatch and a simple, single-threaded executable. Slurm is 2.5.4, blcr is 0.8.5. I'm submitting the job using sbatch: $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh I am able to create the checkpoint and vacate the node: $ scontrol checkpoint create 137 .... time passes .... $ scontrol vacate 137 At that point, I see the checkpoint file from blcr in the current directory and the checkpoint file from Slurm in /var/spool/slurm-llnl/checkpoint. However, when I attempt to restart the job: $ scontrol checkpoint restart 137 scontrol_checkpoint error: Node count specification invalid In slurmctld's log (at level 7) I see: [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=***** [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002 [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0 [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
-
- 30 May, 2013 1 commit
-
-
Morris Jette authored
Uninitialized variables resulted in error of "cons_res: sync loop not progressing, holding job #"
-
- 29 May, 2013 1 commit
-
-
jette authored
The most notable problem case is on a cray where a job step specifically requests one or more node that are not the first nodes in the job allocation
-
- 23 May, 2013 4 commits
-
-
Morris Jette authored
The problem we have observed is the backfill scheduler temporarily gives up its locks (one second), but then reclaims them before the backlog of work completes, basically keeping the backfill scheduler running for a really long time when under a heavy load. bug 297
-
Morris Jette authored
-
Morris Jette authored
Defers (rather than forgets) reboot request with job running on the node within a reservation.
-
Danny Auble authored
-
- 22 May, 2013 2 commits
-
-
Danny Auble authored
-
jette authored
-
- 18 May, 2013 1 commit
-
-
Danny Auble authored
all preemptable jobs on the midplane instead of just the ones it needed to.
-
- 16 May, 2013 2 commits
-
-
Morris Jette authored
This bug was introduced in commit f1cf6d2d fix for bug 290
-
Danny Auble authored
-
- 14 May, 2013 1 commit
-
-
Morris Jette authored
-
- 13 May, 2013 1 commit
-
-
Morris Jette authored
Downing the node will kill all jobs allocated to the node, very bad on something like a BlueGene system
-
- 08 May, 2013 2 commits
-
-
David Bigagli authored
-
Danny Auble authored
the node tab and we didn't notice.
-
- 02 May, 2013 2 commits
- 01 May, 2013 4 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
Morris Jette authored
also "-euidevice sn_single".
-
Morris Jette authored
-
- 30 Apr, 2013 1 commit
-
-
Danny Auble authored
-
- 29 Apr, 2013 3 commits
-
-
Morris Jette authored
Avoid placing pending jobs in AdminHold state due to backfill scheduler interactions with advanced reservation. Specifically, if the backfill scheduler tests a pending job can be scheduled after it's advanced reservation ends then the job was assigned a priority of zero (AdminHold).
-
Danny Auble authored
-
Danny Auble authored
undefined variable.
-
- 26 Apr, 2013 3 commits
-
-
Danny Auble authored
-
Danny Auble authored
requested and allocated.
-
Phil Sharfstein authored
-
- 25 Apr, 2013 1 commit
-
-
Danny Auble authored
-
- 23 Apr, 2013 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
allocation as taking up the entire node instead of just part of the node allocated. And always enforce exclusive on a step request.
-
- 19 Apr, 2013 1 commit
-
-
Danny Auble authored
to attempt to signal tasks on the frontend node.
-
- 18 Apr, 2013 1 commit
-
-
Danny Auble authored
deny the job instead of holding it.
-
- 17 Apr, 2013 3 commits
-
-
Morris Jette authored
Fix for bug 268
-
Danny Auble authored
to implicitly create full system block.
-
Danny Auble authored
cpu count would be reflected correctly.
-
- 16 Apr, 2013 1 commit
-
-
Danny Auble authored
-
- 12 Apr, 2013 1 commit
-
-
Danny Auble authored
-
- 11 Apr, 2013 1 commit
-
-
Danny Auble authored
APRUN_DEFAULT_MEMORY env var for aprun. This scenario will not display the option when used with --launch-cmd.
-