1. 12 Jun, 2015 2 commits
  2. 11 Jun, 2015 1 commit
  3. 10 Jun, 2015 1 commit
  4. 09 Jun, 2015 2 commits
    • David Bigagli's avatar
      Search for user in all groups · 93ead71a
      David Bigagli authored
      93ead71a
    • Morris Jette's avatar
      Fix scheduling inconsistency with GRES · e1a00772
      Morris Jette authored
      1. I submit a first job that uses 1 GPU:
      $ srun --gres gpu:1 --pty bash
      $ echo $CUDA_VISIBLE_DEVICES
      0
      
      2. while the first one is still running, a 2-GPU job asking for 1 task per node
      waits (and I don't really understand why):
      $ srun --ntasks-per-node=1 --gres=gpu:2 --pty bash
      srun: job 2390816 queued and waiting for resources
      
      3. whereas a 2-GPU job requesting 1 core per socket (so just 1 socket) actually
      gets GPUs allocated from two different sockets!
      $ srun -n 1  --cores-per-socket=1 --gres=gpu:2 -p testk --pty bash
      $ echo $CUDA_VISIBLE_DEVICES
      1,2
      
      With this change #2 works the same way as #3.
      bug 1725
      e1a00772
  5. 05 Jun, 2015 1 commit
  6. 04 Jun, 2015 2 commits
  7. 03 Jun, 2015 1 commit
    • Morris Jette's avatar
      switch/cray: Refine PMI_CRAY_NO_SMP_ENV set · ef66b2eb
      Morris Jette authored
      switch/cray: Refine logic to set PMI_CRAY_NO_SMP_ENV environment variable.
      Rather than testing for the task distribution option, test the actual
      task IDs to see fi they are monotonically increasing across all nodes.
      Based upon idea from Brian Gilmer (Cray).
      ef66b2eb
  8. 02 Jun, 2015 3 commits
  9. 01 Jun, 2015 1 commit
  10. 30 May, 2015 1 commit
  11. 29 May, 2015 5 commits
  12. 28 May, 2015 1 commit
  13. 27 May, 2015 1 commit
    • Morris Jette's avatar
      Map job --mem-per-cpu=0 to --mem=0. · 33c77302
      Morris Jette authored
      However, --mem=0 now reflects the appropriate amount of memory in the
      system, --mem-per-cpu=0 hasn't changed.  This allows all the memory to
      be allocated in a cgroup but is not "consumed" and is available for
      other jobs running on the same host.
      Eric Martin, Washington University School of Medicine
      33c77302
  14. 26 May, 2015 1 commit
  15. 22 May, 2015 1 commit
  16. 21 May, 2015 1 commit
  17. 20 May, 2015 2 commits
  18. 19 May, 2015 1 commit
  19. 16 May, 2015 1 commit
  20. 15 May, 2015 2 commits
  21. 14 May, 2015 2 commits
  22. 13 May, 2015 3 commits
  23. 12 May, 2015 1 commit
  24. 11 May, 2015 1 commit
    • Morris Jette's avatar
      Purge old step data on job requeue · beecc7b0
      Morris Jette authored
      Make sure that old step data is purged when a job is requeued.
      Without this logic, if a job terminates abnormally then old step
      data may be left in slurmctld. If the job is then requeued and
      started on a different node, referencing that old job step data
      can result in abnormal events. One specific failure mode is if
      the job is requeued on a node with a different number of cores,
      and the step terminated RPC arrives later, the job and step
      bitmaps of allocated cores can differ in size generating an
      abort.
      bug 1660
      beecc7b0
  25. 08 May, 2015 2 commits