- 03 Jun, 2013 5 commits
-
-
Danny Auble authored
-
David Bigagli authored
-
Nathan Yee authored
test1.70 Validates that srun standard input and output work with binary files. test1.71 Validates that srun exit code matches that of a test program.
-
Morris Jette authored
-
Hongjia Cao authored
We're having some trouble getting our slurm jobs to successfully restart after a checkpoint. For this test, I'm using sbatch and a simple, single-threaded executable. Slurm is 2.5.4, blcr is 0.8.5. I'm submitting the job using sbatch: $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh I am able to create the checkpoint and vacate the node: $ scontrol checkpoint create 137 .... time passes .... $ scontrol vacate 137 At that point, I see the checkpoint file from blcr in the current directory and the checkpoint file from Slurm in /var/spool/slurm-llnl/checkpoint. However, when I attempt to restart the job: $ scontrol checkpoint restart 137 scontrol_checkpoint error: Node count specification invalid In slurmctld's log (at level 7) I see: [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=***** [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002 [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0 [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
-
- 31 May, 2013 11 commits
-
-
jette authored
-
Danny Auble authored
-
Danny Auble authored
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
Rename slurm_step_ctx_params_t field from "mem_per_cpu" to "pn_min_memory". Job step now accepts memory specification in either per-cpu or per-node basis.
-
Danny Auble authored
-
Martin Perry authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
- 30 May, 2013 17 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Yiannis Georgiou authored
-
jette authored
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
Uninitialized variables resulted in error of "cons_res: sync loop not progressing, holding job #"
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
calls only took ~1500 usec to complete call. Since this is out of band this shouldn't be that big of a deal.
-
Danny Auble authored
-
David Bigagli authored
-
- 29 May, 2013 5 commits
-
-
Nathan Yee authored
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
The most notable problem case is on a cray where a job step specifically requests one or more node that are not the first nodes in the job allocation
-
Nathan Yee authored
-
- 28 May, 2013 2 commits
-
-
Morris Jette authored
If node_name2bitmap() is called with best_effort=false, then do not attempt to match names with NodeHostName. Without this change, a partition that contains a NodeHostName rather that NodeName would be configured with the first one found. On a front-end system, this would result in the partition's node_bitmap being out of sync with the actual node positions. To reproduce the problem, configure with --enable-multiple-slurmd Then in slurm.conf, define something like this: NodeName=foo[1-8] NodeHostName=bar ... PartitionName=debug Nodes=bar,foo[1-8] ...
-
Danny Auble authored
-