1. 03 Jun, 2013 1 commit
    • Hongjia Cao's avatar
      restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8
      Hongjia Cao authored
      We're having some trouble getting our slurm jobs to successfully
      restart after a checkpoint.  For this test, I'm using sbatch and a
      simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
      I'm submitting the job using sbatch:
      
      $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
      
      I am able to create the checkpoint and vacate the node:
      
      $ scontrol checkpoint create 137
      .... time passes ....
      $ scontrol vacate 137
      
      At that point, I see the checkpoint file from blcr in the current
      directory and the checkpoint file from Slurm
      in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
      restart the job:
      
      $ scontrol checkpoint restart 137
      scontrol_checkpoint error: Node count specification invalid
      
      In slurmctld's log (at level 7) I see:
      
      [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
      [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
      [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
      [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
      f82e0fb8
  2. 30 May, 2013 1 commit
  3. 29 May, 2013 1 commit
  4. 23 May, 2013 8 commits
  5. 22 May, 2013 5 commits
  6. 21 May, 2013 1 commit
  7. 18 May, 2013 2 commits
  8. 16 May, 2013 2 commits
  9. 14 May, 2013 2 commits
  10. 13 May, 2013 2 commits
  11. 11 May, 2013 1 commit
  12. 10 May, 2013 1 commit
    • Hongjia Cao's avatar
      correctly set alloc state of node in select/linear · 0ef764b5
      Hongjia Cao authored
      fix of the following problem:
      if a node is excised from a job and a reconfiguration(e.g., update
      partition) is done when the job is still running, the node will be left
      in state idle but not available any more until the next
      reconfiguration/restart of slurmctld after the job finished.
      0ef764b5
  13. 08 May, 2013 3 commits
  14. 07 May, 2013 1 commit
  15. 04 May, 2013 1 commit
  16. 03 May, 2013 1 commit
    • jette's avatar
      Make test more robust · 2592eb5e
      jette authored
      Make test work if current working directory not in the search path
      Check for appropriate task rank on POE based systems
      Disable the entire test on POE systems
      2592eb5e
  17. 02 May, 2013 4 commits
  18. 01 May, 2013 3 commits