1. 02 Oct, 2013 1 commit
  2. 23 Sep, 2013 1 commit
  3. 13 Aug, 2013 1 commit
    • jette's avatar
      select/cons_res - Avoid extraneous "oversubscribe" error messages · 302d8b3f
      jette authored
      This problem was reported by Harvard University and could be
      reproduced with a command line of "srun -N1 --tasks-per-node=2 -O id".
      With other job types, the error message could be logged many times
      for each job. This change logs the error once per job and only if
      the job request does not include the -O/--overcommit option.
      302d8b3f
  4. 05 Jul, 2013 1 commit
  5. 28 Jun, 2013 1 commit
  6. 25 Jun, 2013 1 commit
  7. 21 Jun, 2013 2 commits
  8. 12 Jun, 2013 1 commit
  9. 10 Jun, 2013 1 commit
  10. 06 Jun, 2013 1 commit
  11. 05 Jun, 2013 3 commits
  12. 04 Jun, 2013 2 commits
  13. 03 Jun, 2013 1 commit
    • Hongjia Cao's avatar
      restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8
      Hongjia Cao authored
      We're having some trouble getting our slurm jobs to successfully
      restart after a checkpoint.  For this test, I'm using sbatch and a
      simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
      I'm submitting the job using sbatch:
      
      $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
      
      I am able to create the checkpoint and vacate the node:
      
      $ scontrol checkpoint create 137
      .... time passes ....
      $ scontrol vacate 137
      
      At that point, I see the checkpoint file from blcr in the current
      directory and the checkpoint file from Slurm
      in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
      restart the job:
      
      $ scontrol checkpoint restart 137
      scontrol_checkpoint error: Node count specification invalid
      
      In slurmctld's log (at level 7) I see:
      
      [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
      [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
      [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
      [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
      f82e0fb8
  14. 30 May, 2013 1 commit
  15. 29 May, 2013 1 commit
  16. 23 May, 2013 4 commits
  17. 22 May, 2013 2 commits
  18. 18 May, 2013 1 commit
  19. 16 May, 2013 2 commits
  20. 14 May, 2013 1 commit
  21. 13 May, 2013 1 commit
  22. 08 May, 2013 2 commits
  23. 02 May, 2013 2 commits
  24. 01 May, 2013 4 commits
  25. 30 Apr, 2013 1 commit
  26. 29 Apr, 2013 1 commit
    • Morris Jette's avatar
      Fix AdminHold bug with reservations · f7c388ba
      Morris Jette authored
      Avoid placing pending jobs in AdminHold state due to backfill scheduler
      interactions with advanced reservation.
      Specifically, if the backfill scheduler tests a pending job can be
      scheduled after it's advanced reservation ends then the job was
      assigned a priority of zero (AdminHold).
      f7c388ba