1. 31 Oct, 2014 - 3 commits
  2. 27 Oct, 2014 - 2 commits
  3. 24 Oct, 2014 - 1 commit
  4. 23 Oct, 2014 - 2 commits
  5. 21 Oct, 2014 - 1 commit
    • Morris Jette's avatar
      Fix job gres info clear on slurmctld restart · 1209a664
      Morris Jette authored
      Fix bug that prevented preservation of a job's GRES bitmap on slurmctld
      restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a
      job's gres when requeued" and only applies when GRES mapped to specific
      files).
      bug 1192
      1209a664
  6. 17 Oct, 2014 - 1 commit
    • Morris Jette's avatar
      Correct license count for suspended jobs · 77a0bb65
      Morris Jette authored
      Correct tracking of licenses for suspended jobs on slurmctld reconfigure or
      restart. Previously licenses for suspended jobs were not counted, so
      the license count could be exceeded with those jobs get resumed.
      77a0bb65
  7. 15 Oct, 2014 - 3 commits
  8. 14 Oct, 2014 - 2 commits
  9. 11 Oct, 2014 - 2 commits
  10. 10 Oct, 2014 - 1 commit
    • Dorian Krause's avatar
      Job step memory allocation logic fix · 0dd12469
      Dorian Krause authored
      This commit fixes a bug we observed when combining select/linear with
      gres. If an allocation was requested with a --gres argument an srun
      execution within that allocation would stall indefinitely:
      
      -bash-4.1$ salloc -N 1 --gres=gpfs:100
      salloc: Granted job allocation 384049
      bash-4.1$ srun -w j3c017 -n 1 hostname
      srun: Job step creation temporarily disabled, retrying
      
      The slurmctld log showed:
      
      debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1
      debug3:    cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017
      debug3:    host=j3l02 port=33608 name=hostname network=(null) exclusive=0
      debug3:    checkpoint-dir=/home/user checkpoint_int=0
      debug3:    mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0
      debug3:    overcommit=0 time_limit=0 gres=(null) constraints=(null)
      debug:  Configuration for job 384049 complete
      _pick_step_nodes: some requested nodes j3c017 still have memory used by other steps
      _slurm_rpc_job_step_create for job 384049: Requested nodes are busy
      
      If srun --exclusive would have be used instead everything would work fine.
      The reason is that in exclusive mode the code properly checks whether memory
      is a reserved resource in the _pick_step_node() function.
      This commit modifies the alternate code path to do the same.
      0dd12469
  11. 09 Oct, 2014 - 1 commit
  12. 07 Oct, 2014 - 2 commits
  13. 04 Oct, 2014 - 2 commits
    • Morris Jette's avatar
      Resuming powered down node clears down/drain flag · 04cc5643
      Morris Jette authored
      Do not cause it to be rebooted (powered up).
      04cc5643
    • Morris Jette's avatar
      Fix for resuming powered down node · 2894b779
      Morris Jette authored
      This permits a sys admin to power down a node that should already
      be powered down, but avoids setting the NO_RESPOND bit in the
      node state. Doing so under some conditions prevented the node from
      being scheduled. The downside is that the node could possibly be
      allocated when it really isn't ready for use.
      2894b779
  14. 03 Oct, 2014 - 5 commits
  15. 30 Sep, 2014 - 2 commits
  16. 29 Sep, 2014 - 1 commit
  17. 22 Sep, 2014 - 2 commits
  18. 19 Sep, 2014 - 1 commit
  19. 17 Sep, 2014 - 1 commit
    • Morris Jette's avatar
      Add more job submit validity checks · 84807b11
      Morris Jette authored
      Test 3.11 was failing in some configurations without this as
      the CPU count in the RPC was lower than the number of nodes in
      the required node list
      84807b11
  20. 16 Sep, 2014 - 3 commits
  21. 11 Sep, 2014 - 1 commit
  22. 09 Sep, 2014 - 1 commit
    • Morris Jette's avatar
      fix MaxJobCount enforcement race condition · a768e5f9
      Morris Jette authored
      Eliminate race condition in enforcement of MaxJobCount limit for job arrays.
      The job count limit was checked for a job array before setting the slurmctld
      job locks. If new jobs were submitted between the test and the job array
      creation such that the job array creation would result in MaxJobCount being
      exceeded, then a fatal error would result.
      bug 1091
      a768e5f9