1. 05 Jun, 2013 1 commit
    • Janne Blomqvist's avatar
      priority/multifactor2 - Prevent possible divide by zero. · fc3997f9
      Janne Blomqvist authored
      Andy Wettstein (University of Chicago) reported privately to me that slurmctld
      2.5.4 crashed after he enabled the priority/multifactor2 plugin due to a
      division by zero error.
      
      I was able to reproduce the crash by creating an account hierarchy where all
      the accounts and users had zero shares.
      See bug 315
      fc3997f9
  2. 04 Jun, 2013 2 commits
  3. 03 Jun, 2013 1 commit
    • Hongjia Cao's avatar
      restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8
      Hongjia Cao authored
      We're having some trouble getting our slurm jobs to successfully
      restart after a checkpoint.  For this test, I'm using sbatch and a
      simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
      I'm submitting the job using sbatch:
      
      $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
      
      I am able to create the checkpoint and vacate the node:
      
      $ scontrol checkpoint create 137
      .... time passes ....
      $ scontrol vacate 137
      
      At that point, I see the checkpoint file from blcr in the current
      directory and the checkpoint file from Slurm
      in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
      restart the job:
      
      $ scontrol checkpoint restart 137
      scontrol_checkpoint error: Node count specification invalid
      
      In slurmctld's log (at level 7) I see:
      
      [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
      [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
      [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
      [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
      f82e0fb8
  4. 30 May, 2013 1 commit
  5. 29 May, 2013 1 commit
  6. 23 May, 2013 4 commits
  7. 22 May, 2013 2 commits
  8. 18 May, 2013 1 commit
  9. 16 May, 2013 2 commits
  10. 14 May, 2013 1 commit
  11. 13 May, 2013 1 commit
  12. 08 May, 2013 2 commits
  13. 02 May, 2013 2 commits
  14. 01 May, 2013 4 commits
  15. 30 Apr, 2013 1 commit
  16. 29 Apr, 2013 3 commits
  17. 26 Apr, 2013 3 commits
  18. 25 Apr, 2013 1 commit
  19. 23 Apr, 2013 2 commits
  20. 19 Apr, 2013 1 commit
  21. 18 Apr, 2013 1 commit
  22. 17 Apr, 2013 3 commits