- 05 Jun, 2013 18 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Conflicts: META NEWS src/plugins/priority/multifactor2/priority_multifactor2.c
-
Morris Jette authored
-
Nathan Yee authored
-
Janne Blomqvist authored
Andy Wettstein (University of Chicago) reported privately to me that slurmctld 2.5.4 crashed after he enabled the priority/multifactor2 plugin due to a division by zero error. I was able to reproduce the crash by creating an account hierarchy where all the accounts and users had zero shares. See bug 315
-
Danny Auble authored
-
Danny Auble authored
-
David Bigagli authored
Revert premature change of META
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
Without this change, it appears that POE ignores the -procs argument resulting in a job step request with multiple host names, but only one ntask required
-
Danny Auble authored
-
Danny Auble authored
-
David Bigagli authored
-
- 04 Jun, 2013 6 commits
-
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
For example "host1*2" is equivalent to "host1,host1".
-
jette authored
Without this change, it appears that POE ignores the -procs argument resulting in a job step request with multiple host names, but only one ntask required
-
Morris Jette authored
Still needs more work
-
- 03 Jun, 2013 11 commits
-
-
Morris Jette authored
Conflicts: META NEWS
-
Morris Jette authored
-
Morris Jette authored
-
Rod Schultz authored
-
Morris Jette authored
-
jette authored
Previously if the required node has no available CPUs left, then other nodes in the job allocation would be used
-
Danny Auble authored
-
David Bigagli authored
-
Nathan Yee authored
test1.70 Validates that srun standard input and output work with binary files. test1.71 Validates that srun exit code matches that of a test program.
-
Morris Jette authored
-
Hongjia Cao authored
We're having some trouble getting our slurm jobs to successfully restart after a checkpoint. For this test, I'm using sbatch and a simple, single-threaded executable. Slurm is 2.5.4, blcr is 0.8.5. I'm submitting the job using sbatch: $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh I am able to create the checkpoint and vacate the node: $ scontrol checkpoint create 137 .... time passes .... $ scontrol vacate 137 At that point, I see the checkpoint file from blcr in the current directory and the checkpoint file from Slurm in /var/spool/slurm-llnl/checkpoint. However, when I attempt to restart the job: $ scontrol checkpoint restart 137 scontrol_checkpoint error: Node count specification invalid In slurmctld's log (at level 7) I see: [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=***** [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002 [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0 [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
-
- 31 May, 2013 5 commits
-
-
jette authored
-
Danny Auble authored
-
Danny Auble authored
-
jette authored
-
Morris Jette authored
-