• Hongjia Cao's avatar
    restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8
    Hongjia Cao authored
    We're having some trouble getting our slurm jobs to successfully
    restart after a checkpoint.  For this test, I'm using sbatch and a
    simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
    I'm submitting the job using sbatch:
    
    $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
    
    I am able to create the checkpoint and vacate the node:
    
    $ scontrol checkpoint create 137
    .... time passes ....
    $ scontrol vacate 137
    
    At that point, I see the checkpoint file from blcr in the current
    directory and the checkpoint file from Slurm
    in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
    restart the job:
    
    $ scontrol checkpoint restart 137
    scontrol_checkpoint error: Node count specification invalid
    
    In slurmctld's log (at level 7) I see:
    
    [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
    [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
    [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
    [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
    f82e0fb8
To find the state of this project's repository at the time of any of these versions, check out the tags.