1. 29 Oct, 2014 1 commit
  2. 27 Oct, 2014 4 commits
  3. 24 Oct, 2014 2 commits
  4. 23 Oct, 2014 1 commit
  5. 21 Oct, 2014 4 commits
  6. 20 Oct, 2014 5 commits
  7. 18 Oct, 2014 2 commits
  8. 17 Oct, 2014 6 commits
  9. 16 Oct, 2014 3 commits
    • Morris Jette's avatar
      sched/backfill: don't clear running job start_time · e6290537
      Morris Jette authored
      bug 1178
      e6290537
    • Brian Christiansen's avatar
      e1c42895
    • Morris Jette's avatar
      Change Cray mpi_fini failure logic · 5f89223f
      Morris Jette authored
      Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
      that specific task and let srun handle all timeout logic.
      Previous logic would cancel the entire job step and srun options
      for wait time and kill on exit were ignored. The new logic provides
      users with the following type of response:
      
      $ srun -n3 -K0 -N3 --wait=60 ./tmp
      Task:0 Cycle:1
      Task:2 Cycle:1
      Task:1 Cycle:1
      Task:0 Cycle:2
      Task:2 Cycle:2
      slurmstepd: step 14927.0 task 1 exited without calling mpi_fini()
      srun: error: tux2: task 1: Killed
      Task:0 Cycle:3
      Task:2 Cycle:3
      Task:0 Cycle:4
      ...
      
      bug 1171
      5f89223f
  10. 15 Oct, 2014 5 commits
  11. 14 Oct, 2014 3 commits
  12. 10 Oct, 2014 4 commits
    • Danny Auble's avatar
      6bf40ed9
    • Brian Christiansen's avatar
      5d6a2dc2
    • Dorian Krause's avatar
      Job step memory allocation logic fix · f288e4eb
      Dorian Krause authored
      This commit fixes a bug we observed when combining select/linear with
      gres. If an allocation was requested with a --gres argument an srun
      execution within that allocation would stall indefinitely:
      
      -bash-4.1$ salloc -N 1 --gres=gpfs:100
      salloc: Granted job allocation 384049
      bash-4.1$ srun -w j3c017 -n 1 hostname
      srun: Job step creation temporarily disabled, retrying
      
      The slurmctld log showed:
      
      debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1
      debug3:    cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017
      debug3:    host=j3l02 port=33608 name=hostname network=(null) exclusive=0
      debug3:    checkpoint-dir=/home/user checkpoint_int=0
      debug3:    mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0
      debug3:    overcommit=0 time_limit=0 gres=(null) constraints=(null)
      debug:  Configuration for job 384049 complete
      _pick_step_nodes: some requested nodes j3c017 still have memory used by other steps
      _slurm_rpc_job_step_create for job 384049: Requested nodes are busy
      
      If srun --exclusive would have be used instead everything would work fine.
      The reason is that in exclusive mode the code properly checks whether memory
      is a reserved resource in the _pick_step_node() function.
      This commit modifies the alternate code path to do the same.
      f288e4eb
    • Dorian Krause's avatar
      Job step memory allocation logic fix · 0dd12469
      Dorian Krause authored
      This commit fixes a bug we observed when combining select/linear with
      gres. If an allocation was requested with a --gres argument an srun
      execution within that allocation would stall indefinitely:
      
      -bash-4.1$ salloc -N 1 --gres=gpfs:100
      salloc: Granted job allocation 384049
      bash-4.1$ srun -w j3c017 -n 1 hostname
      srun: Job step creation temporarily disabled, retrying
      
      The slurmctld log showed:
      
      debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1
      debug3:    cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017
      debug3:    host=j3l02 port=33608 name=hostname network=(null) exclusive=0
      debug3:    checkpoint-dir=/home/user checkpoint_int=0
      debug3:    mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0
      debug3:    overcommit=0 time_limit=0 gres=(null) constraints=(null)
      debug:  Configuration for job 384049 complete
      _pick_step_nodes: some requested nodes j3c017 still have memory used by other steps
      _slurm_rpc_job_step_create for job 384049: Requested nodes are busy
      
      If srun --exclusive would have be used instead everything would work fine.
      The reason is that in exclusive mode the code properly checks whether memory
      is a reserved resource in the _pick_step_node() function.
      This commit modifies the alternate code path to do the same.
      0dd12469