1. 09 Jul, 2013 3 commits
  2. 08 Jul, 2013 1 commit
  3. 06 Jul, 2013 2 commits
  4. 05 Jul, 2013 2 commits
    • John Thiltges's avatar
      Correction to memory limit calculation for mem per cpu with threads · 0d537d32
      John Thiltges authored
      When using ThreadsPerCore > 1, it appears that DefMemPerCPU is being
      scaled by slurmctld, but not by slurmd/slurmstepd.
      
      For example, we set ThreadsPerCore=2 and DefMemPerCPU=100. Running a
      single core job, we would expect two threads to be allocated and AllocMem
      on the assigned node to increase by 200MB. scontrol reports that AllocMem
      increased by 200MB, but the task/cgroup plugin only sees 100M of RAM.
      
      It looks like the problem may lie in common/slurm_cred.c:format_core_allocs().
      The function counts the job/step cores and multiplies the mem_limit's,
      but it does not scale the CPU count like in slurmd/slurmd/req.c:_check_job_credential().
      See bug 309
      0d537d32
    • jette's avatar
  5. 28 Jun, 2013 3 commits
  6. 26 Jun, 2013 2 commits
  7. 25 Jun, 2013 3 commits
  8. 24 Jun, 2013 1 commit
    • jette's avatar
      Modify slurmctld locking to improve performance · ba58d59c
      jette authored
      Under very heavy load with many thousands of batch job submissions
      or job signals, the write lock can be held for very long periods of
      time preventing job scheduling, squeue response, etc. This code
      inserts a timing break to permit other functions to get the locks.
      ba58d59c
  9. 21 Jun, 2013 4 commits
  10. 18 Jun, 2013 2 commits
  11. 12 Jun, 2013 1 commit
  12. 10 Jun, 2013 1 commit
  13. 07 Jun, 2013 1 commit
  14. 06 Jun, 2013 1 commit
  15. 05 Jun, 2013 5 commits
  16. 04 Jun, 2013 3 commits
  17. 03 Jun, 2013 2 commits
    • Morris Jette's avatar
      Start NEWS for v2.5.8 · c795724d
      Morris Jette authored
      c795724d
    • Hongjia Cao's avatar
      restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8
      Hongjia Cao authored
      We're having some trouble getting our slurm jobs to successfully
      restart after a checkpoint.  For this test, I'm using sbatch and a
      simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
      I'm submitting the job using sbatch:
      
      $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
      
      I am able to create the checkpoint and vacate the node:
      
      $ scontrol checkpoint create 137
      .... time passes ....
      $ scontrol vacate 137
      
      At that point, I see the checkpoint file from blcr in the current
      directory and the checkpoint file from Slurm
      in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
      restart the job:
      
      $ scontrol checkpoint restart 137
      scontrol_checkpoint error: Node count specification invalid
      
      In slurmctld's log (at level 7) I see:
      
      [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
      [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
      [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
      [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
      f82e0fb8
  18. 31 May, 2013 1 commit
  19. 30 May, 2013 1 commit
  20. 29 May, 2013 1 commit