1. 05 May, 2014 1 commit
  2. 02 May, 2014 2 commits
  3. 30 Apr, 2014 1 commit
    • Morris Jette's avatar
      switch/nrt - CAU and RMDA tracking correction · 6f66fdef
      Morris Jette authored
      Switch/nrt - Properly track usage of CAU and RDMA resources with multiple
      tasks per compute node. Previous logic would allocate resources once per
      task and then deallocate once per node, leaking CMA and RDMA resources
      and preventing their use by future jobs.
      6f66fdef
  4. 18 Apr, 2014 1 commit
    • Morris Jette's avatar
      switch/nrt - free partial allocation · a197a1da
      Morris Jette authored
      On switch resource allocation failure, free partial allocation.
      Failure mode was CAU could be allocated on some nodes, but not
      others. The CAU allocated on nodes and switches up to the failure
      point were never released.
      a197a1da
  5. 08 Apr, 2014 5 commits
  6. 07 Apr, 2014 7 commits
  7. 05 Apr, 2014 6 commits
  8. 04 Apr, 2014 8 commits
  9. 03 Apr, 2014 2 commits
  10. 02 Apr, 2014 2 commits
    • Morris Jette's avatar
      Minor tweak to scheduler cycle timing · 8fb863f9
      Morris Jette authored
      Decrease maximimum scheduler main loop run time from 10 secs to
      4 secs for improved performance.
      If running with sched/backfill, do not run through all jobs on
      periodic scheduling loop, but only the default depth. The
      backfill scheduler can go through more jobs anyway due to its
      ability to relinquish and recover locks.
      See bug 616
      8fb863f9
    • Morris Jette's avatar
      launch/poe - fix network value · ad7100b8
      Morris Jette authored
      if an job step's network value is set by poe, either by directly
      executing poe or srun launching poe, that value was not being
      propagated to the job step creation RPC and the network was not
      being set up for the proper protocol (e.g. mpi, lapi, pami, etc.).
      The previous logic would only work if the srun execute line
      explicitly set the protocol using the --network option.
      ad7100b8
  11. 31 Mar, 2014 2 commits
  12. 26 Mar, 2014 1 commit
  13. 25 Mar, 2014 1 commit
  14. 24 Mar, 2014 1 commit
    • Morris Jette's avatar
      job array dependency recovery fix · fca71890
      Morris Jette authored
      When slurmctld restarted, it would not recover dependencies on
      job array elements and would just discard the depenency. This
      corrects the parsing problem to recover the dependency. The old code
      would print a mesage like this and discard it:
      slurmctld: error: Invalid dependencies discarded for job 51: afterany:47_*
      fca71890