1. 09 May, 2018 14 commits
  2. 08 May, 2018 3 commits
    • Brian Christiansen's avatar
      Prevent slurmd from launching steps if prolog fail · 3b029021
      Brian Christiansen authored
      Bug 5146
      3b029021
    • Tim Wickberg's avatar
      Fix issue with invalid protocol_version when using srun on ppc64. · 77d65f4f
      Tim Wickberg authored
      Caused by a corrupted protocol_version field value being received
      by the slurmstepd, as we cannot safely write/read a uint16_t across
      the pipe as if it was an int.
      
      Regression caused by commit 90b116c2.
      
      Bug 5133.
      77d65f4f
    • Brian Christiansen's avatar
      Fix checkpointing requeued jobs in a bad state · f9f395af
      Brian Christiansen authored
      Requeued jobs are marked as PENDING|COMPLETING until the epilog checks
      in. The issue is that if job_set_alloc_tres gets called while in the
      PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If
      this job then gets checkpointed in this state (PENDING|COMPLETING + no
      tres_alloc_str) on startup the controller would crash because it
      expected the job to have a tres_alloc_str/cnt when in the COMPLETING
      state. This could be triggered if starting the controller without the
      dbd up. When the dbd comes up, the assoc_cache_mgr calls
      _update_job_tres() which calls job_set_alloc_tres. It could also be
      triggered by adding new tres.
      
      This most likely started happening in 17.11.5 because of commit
      865b672f which introduced calling _update_job_tres() on each job
      after the dbd comes up.
      
      Bugs 5137,4522
      f9f395af
  3. 04 May, 2018 2 commits
  4. 03 May, 2018 6 commits
  5. 02 May, 2018 6 commits
  6. 01 May, 2018 2 commits
  7. 30 Apr, 2018 7 commits