1. 03 Jun, 2015 1 commit
    • Morris Jette's avatar
      Add srun --accel-bind option · 4d3726b2
      Morris Jette authored
      Add srun --accel-bind option to control how tasks are bound to GPUs and NIC
          Generic RESources (GRES).
      Based in part upon work by Matthieu Ospici (ATOS).
      gres/nic plugin modified to set OMPI_MCA_btl_openib_if_include environment
          variable based upon allocated devices (usable with OpenMPI and Melanox).
      Reset GRES env vars after task affinity set
      4d3726b2
  2. 01 Jun, 2015 2 commits
  3. 30 May, 2015 1 commit
  4. 29 May, 2015 5 commits
  5. 28 May, 2015 2 commits
  6. 27 May, 2015 1 commit
    • Morris Jette's avatar
      Map job --mem-per-cpu=0 to --mem=0. · 33c77302
      Morris Jette authored
      However, --mem=0 now reflects the appropriate amount of memory in the
      system, --mem-per-cpu=0 hasn't changed.  This allows all the memory to
      be allocated in a cgroup but is not "consumed" and is available for
      other jobs running on the same host.
      Eric Martin, Washington University School of Medicine
      33c77302
  7. 26 May, 2015 3 commits
  8. 22 May, 2015 3 commits
  9. 21 May, 2015 2 commits
  10. 20 May, 2015 2 commits
  11. 19 May, 2015 1 commit
  12. 16 May, 2015 1 commit
  13. 15 May, 2015 2 commits
  14. 14 May, 2015 3 commits
  15. 13 May, 2015 4 commits
  16. 12 May, 2015 2 commits
  17. 11 May, 2015 1 commit
    • Morris Jette's avatar
      Purge old step data on job requeue · beecc7b0
      Morris Jette authored
      Make sure that old step data is purged when a job is requeued.
      Without this logic, if a job terminates abnormally then old step
      data may be left in slurmctld. If the job is then requeued and
      started on a different node, referencing that old job step data
      can result in abnormal events. One specific failure mode is if
      the job is requeued on a node with a different number of cores,
      and the step terminated RPC arrives later, the job and step
      bitmaps of allocated cores can differ in size generating an
      abort.
      bug 1660
      beecc7b0
  18. 08 May, 2015 4 commits