1. 09 May, 2018 9 commits
  2. 08 May, 2018 3 commits
    • Brian Christiansen's avatar
      Prevent slurmd from launching steps if prolog fail · 3b029021
      Brian Christiansen authored
      Bug 5146
      3b029021
    • Tim Wickberg's avatar
      Fix issue with invalid protocol_version when using srun on ppc64. · 77d65f4f
      Tim Wickberg authored
      Caused by a corrupted protocol_version field value being received
      by the slurmstepd, as we cannot safely write/read a uint16_t across
      the pipe as if it was an int.
      
      Regression caused by commit 90b116c2.
      
      Bug 5133.
      77d65f4f
    • Brian Christiansen's avatar
      Fix checkpointing requeued jobs in a bad state · f9f395af
      Brian Christiansen authored
      Requeued jobs are marked as PENDING|COMPLETING until the epilog checks
      in. The issue is that if job_set_alloc_tres gets called while in the
      PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If
      this job then gets checkpointed in this state (PENDING|COMPLETING + no
      tres_alloc_str) on startup the controller would crash because it
      expected the job to have a tres_alloc_str/cnt when in the COMPLETING
      state. This could be triggered if starting the controller without the
      dbd up. When the dbd comes up, the assoc_cache_mgr calls
      _update_job_tres() which calls job_set_alloc_tres. It could also be
      triggered by adding new tres.
      
      This most likely started happening in 17.11.5 because of commit
      865b672f which introduced calling _update_job_tres() on each job
      after the dbd comes up.
      
      Bugs 5137,4522
      f9f395af
  3. 04 May, 2018 1 commit
  4. 03 May, 2018 3 commits
  5. 02 May, 2018 6 commits
  6. 01 May, 2018 1 commit
    • Danny Auble's avatar
      Fix total TRES Billing on partitions. · 3686dd9c
      Danny Auble authored
      Turns out the partititon's billing tres was working off the sum of
      the node_ptrs which contain the max of all partitions they are in.
      
      This isn't correct since each partition's billing can be different.
      
      Set it correctly here.
      3686dd9c
  7. 30 Apr, 2018 3 commits
  8. 28 Apr, 2018 2 commits
  9. 23 Apr, 2018 2 commits
  10. 19 Apr, 2018 2 commits
  11. 17 Apr, 2018 1 commit
    • Morris Jette's avatar
      Make UnavailableNodes value in job reason be correct for each job · fc4e5ac9
      Morris Jette authored
      1. Identifies nodes which are unavailable to a specific job,
      adding a call to filter_by_node_owner() in select_nodes()
       where the node list is generated.
      2. Removes the "unavail_node_str" argument to select_nodes()
      as it is no longer useful. This string originally was originally
      generated once at the start of the job scheduling logic for all jobs,
      but since each job can have a different set of
       unavailable nodes (dedicated to user, group, etc.)
      so the same string for all jobs can be misleading.
      
      Bug 4932.
      fc4e5ac9
  12. 16 Apr, 2018 3 commits
  13. 11 Apr, 2018 4 commits