1. 05 Jun, 2013 1 commit
    • Janne Blomqvist's avatar
      priority/multifactor2 - Prevent possible divide by zero. · fc3997f9
      Janne Blomqvist authored
      Andy Wettstein (University of Chicago) reported privately to me that slurmctld
      2.5.4 crashed after he enabled the priority/multifactor2 plugin due to a
      division by zero error.
      
      I was able to reproduce the crash by creating an account hierarchy where all
      the accounts and users had zero shares.
      See bug 315
      fc3997f9
  2. 04 Jun, 2013 3 commits
  3. 03 Jun, 2013 2 commits
    • jette's avatar
      Fix for job step allocation with required hostlist and exclusive option · 523b1992
      jette authored
      Previously if the required node has no available CPUs left, then other
      nodes in the job allocation would be used
      523b1992
    • Hongjia Cao's avatar
      restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8
      Hongjia Cao authored
      We're having some trouble getting our slurm jobs to successfully
      restart after a checkpoint.  For this test, I'm using sbatch and a
      simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
      I'm submitting the job using sbatch:
      
      $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
      
      I am able to create the checkpoint and vacate the node:
      
      $ scontrol checkpoint create 137
      .... time passes ....
      $ scontrol vacate 137
      
      At that point, I see the checkpoint file from blcr in the current
      directory and the checkpoint file from Slurm
      in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
      restart the job:
      
      $ scontrol checkpoint restart 137
      scontrol_checkpoint error: Node count specification invalid
      
      In slurmctld's log (at level 7) I see:
      
      [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
      [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
      [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
      [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
      f82e0fb8
  4. 30 May, 2013 1 commit
  5. 29 May, 2013 1 commit
  6. 23 May, 2013 8 commits
  7. 22 May, 2013 5 commits
  8. 21 May, 2013 1 commit
  9. 18 May, 2013 2 commits
  10. 16 May, 2013 2 commits
  11. 14 May, 2013 2 commits
  12. 13 May, 2013 2 commits
  13. 11 May, 2013 1 commit
  14. 10 May, 2013 1 commit
    • Hongjia Cao's avatar
      correctly set alloc state of node in select/linear · 0ef764b5
      Hongjia Cao authored
      fix of the following problem:
      if a node is excised from a job and a reconfiguration(e.g., update
      partition) is done when the job is still running, the node will be left
      in state idle but not available any more until the next
      reconfiguration/restart of slurmctld after the job finished.
      0ef764b5
  15. 08 May, 2013 3 commits
  16. 07 May, 2013 1 commit
  17. 04 May, 2013 1 commit
  18. 03 May, 2013 1 commit
    • jette's avatar
      Make test more robust · 2592eb5e
      jette authored
      Make test work if current working directory not in the search path
      Check for appropriate task rank on POE based systems
      Disable the entire test on POE systems
      2592eb5e
  19. 02 May, 2013 2 commits