1. 03 Mar, 2015 1 commit
    • Morris Jette's avatar
      Abort I/O for debugged app launch fail · 49770e20
      Morris Jette authored
      For job running under a debugger, if the exec of the task fails, then
      cancel its I/O and abort immediately rather than waiting 60 seconds for
      I/O timeout.
      49770e20
  2. 02 Mar, 2015 4 commits
  3. 27 Feb, 2015 5 commits
  4. 26 Feb, 2015 2 commits
  5. 25 Feb, 2015 4 commits
  6. 24 Feb, 2015 5 commits
  7. 20 Feb, 2015 1 commit
    • Dorian Krause's avatar
      Fix to GRES NoConsume logic · 33c48ac5
      Dorian Krause authored
      we came across the following error message in the slurmctld logs when
      using non-consumable resources:
      
      error: gres/potion: job 39 dealloc of node node1 bad node_offset 0 count
      is 0
      
      The error comes from _job_dealloc():
      
      node_gres_data=0x7f8a18000b70, node_offset=0, gres_name=0x1999e00
      "potion", job_id=46,
          node_name=0x1987ab0 "node1") at gres.c:3980
      (job_gres_list=0x199b7c0, node_gres_list=0x199bc38, node_offset=0,
      job_id=46,
          node_name=0x1987ab0 "node1") at gres.c:4190
      job_ptr=0x19e9d50, pre_err=0x7f8a31353cb0 "_will_run_test", remove_all=true)
          at select_linear.c:2091
      bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, max_share=1, req_nodes=1,
          preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40) at
      select_linear.c:3176
      bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
          preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
      exc_core_bitmap=0x0) at select_linear.c:3390
      bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
          preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
      exc_core_bitmap=0x0) at node_select.c:588
      avail_bitmap=0x7f8a2f910d38, min_nodes=1, max_nodes=1, req_nodes=1,
      exc_core_bitmap=0x0)
          at backfill.c:367
      
      The cause of this problem is that _node_state_dup() in gres.c does not
      duplicate the no_consume flag.
      The cr_ptr passed to _rm_job_from_nodes() is created with _dup_cr()
      which calls _node_state_dup().
      
      Below is a simple patch to fix the problem. A "future-proof" alternative
      might be to memcpy() from gres_ptr to new_gres and
      only handle pointers separately.
      33c48ac5
  8. 19 Feb, 2015 3 commits
  9. 18 Feb, 2015 2 commits
  10. 17 Feb, 2015 4 commits
  11. 13 Feb, 2015 2 commits
    • David Bigagli's avatar
      Fix squeue. · c13e8540
      David Bigagli authored
      c13e8540
    • Morris Jette's avatar
      Avoid triggering accounting if node state unchanged · 23f84ace
      Morris Jette authored
      If call was made to change a node's state to the same state it
      was already in and set its reason to the same value it already
      had, then an accounting record was generated. If a script, say
      NodeHealthCheck is repeatedly setting a node state (say DRAIN),
      it could generate a huge number of redundant accounting records.
      This eliminates these redundant records.
      related to bug 1437
      23f84ace
  12. 12 Feb, 2015 4 commits
  13. 11 Feb, 2015 1 commit
  14. 10 Feb, 2015 2 commits
    • Brian Christiansen's avatar
      Additional fix to 50e0c84f. · 50b43afd
      Brian Christiansen authored
      uid's are 0 when associations are loaded.
      50b43afd
    • Morris Jette's avatar
      Backfill scheduler bug on job's partition change · a0d12d0c
      Morris Jette authored
      The backfill scheduler build a queue of eligible job/partition
      information and then proceeds to determine when and where those
      jobs will start. The backfill scheduler can be configured to
      periodically release locks in order to let other operations
      take place. If the partition(s) associated with one of those
      jobs changes during one of those periods, the job will still
      be considered for scheduling in the old partition until the
      backfill scheduler starts over with a new job/partition list.
      This change to the backfill scheduler validates each job's
      partition in from the list based upon current information
      (considering any partition changes).
      See bug 1436
      a0d12d0c