1. 09 May, 2018 16 commits
  2. 08 May, 2018 3 commits
    • Brian Christiansen's avatar
      Prevent slurmd from launching steps if prolog fail · 3b029021
      Brian Christiansen authored
      Bug 5146
      3b029021
    • Tim Wickberg's avatar
      Fix issue with invalid protocol_version when using srun on ppc64. · 77d65f4f
      Tim Wickberg authored
      Caused by a corrupted protocol_version field value being received
      by the slurmstepd, as we cannot safely write/read a uint16_t across
      the pipe as if it was an int.
      
      Regression caused by commit 90b116c2.
      
      Bug 5133.
      77d65f4f
    • Brian Christiansen's avatar
      Fix checkpointing requeued jobs in a bad state · f9f395af
      Brian Christiansen authored
      Requeued jobs are marked as PENDING|COMPLETING until the epilog checks
      in. The issue is that if job_set_alloc_tres gets called while in the
      PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If
      this job then gets checkpointed in this state (PENDING|COMPLETING + no
      tres_alloc_str) on startup the controller would crash because it
      expected the job to have a tres_alloc_str/cnt when in the COMPLETING
      state. This could be triggered if starting the controller without the
      dbd up. When the dbd comes up, the assoc_cache_mgr calls
      _update_job_tres() which calls job_set_alloc_tres. It could also be
      triggered by adding new tres.
      
      This most likely started happening in 17.11.5 because of commit
      865b672f which introduced calling _update_job_tres() on each job
      after the dbd comes up.
      
      Bugs 5137,4522
      f9f395af
  3. 04 May, 2018 2 commits
  4. 03 May, 2018 6 commits
  5. 02 May, 2018 6 commits
  6. 01 May, 2018 2 commits
  7. 30 Apr, 2018 5 commits
    • Tim Wickberg's avatar
      Remove _task_sleep() from slurm_jobacct_gather.c. · 3be9e1ee
      Tim Wickberg authored
      The use in _watch_tasks needs to be removed as the switch to pthread_signal
      from pthread_cancel means this will not get interrupted and would keep the
      step alive for at least a second, potentially harming throughput.
      Since the call to _poll_data() happens after the first timer expires,
      this delay turns out to be unnecessary, so we won't be replacing it with
      a pthread_cond_timedwait() construct.
      
      The use jobacct_gather_stat_task() is unnecessary since the two locations
      this can happen take place after _fork_all_tasks() has setup the tasks,
      thus the delay should not be necessary.
      
      Bug 5103.
      3be9e1ee
    • Tim Wickberg's avatar
      Remove unsafe use of pthread_cancel() in slurmstepd. · a7c8964e
      Tim Wickberg authored
      These functions are not async-cancel-safe, and cannot safely be cancelled.
      This leads to potential deadlock, either between our own locks, or deep
      inside glibc when the thread held a malloc arena lock when canceled.
      
      Replace with pthread_signal to the appropriate cond to
      wake threads up at the appropriate time instead.
      
      Bug 5103.
      a7c8964e
    • Danny Auble's avatar
      Make a global for each of the accounting gather profile timers. · 1675ada0
      Danny Auble authored
      This will make it easier in a future commit to avoid the
      async pthread_cancel.
      
      Bug 5103
      1675ada0
    • Alejandro Sanchez's avatar
      doc/html/faq - add missing braces to example. · fd9b143a
      Alejandro Sanchez authored
      Bug 5110.
      fd9b143a
    • Marshall Garey's avatar
      Testsuite - fix issues with test1.103 when MaxTime is set on the partition. · 3138a98e
      Marshall Garey authored
      Remove partition MaxTime limit at the beginning of the test,
      run the rest of the test, then restore the partition configuration
      with scontrol reconfigure.
      
      Bug 4994.
      3138a98e