- 04 Jun, 2013 1 commit
-
-
Morris Jette authored
Still needs more work
-
- 03 Jun, 2013 11 commits
-
-
Morris Jette authored
Conflicts: META NEWS
-
Morris Jette authored
-
Morris Jette authored
-
Rod Schultz authored
-
Morris Jette authored
-
jette authored
Previously if the required node has no available CPUs left, then other nodes in the job allocation would be used
-
Danny Auble authored
-
David Bigagli authored
-
Nathan Yee authored
test1.70 Validates that srun standard input and output work with binary files. test1.71 Validates that srun exit code matches that of a test program.
-
Morris Jette authored
-
Hongjia Cao authored
We're having some trouble getting our slurm jobs to successfully restart after a checkpoint. For this test, I'm using sbatch and a simple, single-threaded executable. Slurm is 2.5.4, blcr is 0.8.5. I'm submitting the job using sbatch: $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh I am able to create the checkpoint and vacate the node: $ scontrol checkpoint create 137 .... time passes .... $ scontrol vacate 137 At that point, I see the checkpoint file from blcr in the current directory and the checkpoint file from Slurm in /var/spool/slurm-llnl/checkpoint. However, when I attempt to restart the job: $ scontrol checkpoint restart 137 scontrol_checkpoint error: Node count specification invalid In slurmctld's log (at level 7) I see: [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=***** [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002 [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0 [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
-
- 31 May, 2013 11 commits
-
-
jette authored
-
Danny Auble authored
-
Danny Auble authored
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
Rename slurm_step_ctx_params_t field from "mem_per_cpu" to "pn_min_memory". Job step now accepts memory specification in either per-cpu or per-node basis.
-
Danny Auble authored
-
Martin Perry authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
- 30 May, 2013 17 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Yiannis Georgiou authored
-
jette authored
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
Uninitialized variables resulted in error of "cons_res: sync loop not progressing, holding job #"
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
calls only took ~1500 usec to complete call. Since this is out of band this shouldn't be that big of a deal.
-
Danny Auble authored
-
David Bigagli authored
-