1. 09 May, 2018 18 commits
  2. 08 May, 2018 3 commits
    • Brian Christiansen's avatar
      Prevent slurmd from launching steps if prolog fail · 3b029021
      Brian Christiansen authored
      Bug 5146
      3b029021
    • Tim Wickberg's avatar
      Fix issue with invalid protocol_version when using srun on ppc64. · 77d65f4f
      Tim Wickberg authored
      Caused by a corrupted protocol_version field value being received
      by the slurmstepd, as we cannot safely write/read a uint16_t across
      the pipe as if it was an int.
      
      Regression caused by commit 90b116c2.
      
      Bug 5133.
      77d65f4f
    • Brian Christiansen's avatar
      Fix checkpointing requeued jobs in a bad state · f9f395af
      Brian Christiansen authored
      Requeued jobs are marked as PENDING|COMPLETING until the epilog checks
      in. The issue is that if job_set_alloc_tres gets called while in the
      PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If
      this job then gets checkpointed in this state (PENDING|COMPLETING + no
      tres_alloc_str) on startup the controller would crash because it
      expected the job to have a tres_alloc_str/cnt when in the COMPLETING
      state. This could be triggered if starting the controller without the
      dbd up. When the dbd comes up, the assoc_cache_mgr calls
      _update_job_tres() which calls job_set_alloc_tres. It could also be
      triggered by adding new tres.
      
      This most likely started happening in 17.11.5 because of commit
      865b672f which introduced calling _update_job_tres() on each job
      after the dbd comes up.
      
      Bugs 5137,4522
      f9f395af
  3. 04 May, 2018 2 commits
  4. 03 May, 2018 6 commits
  5. 02 May, 2018 6 commits
  6. 01 May, 2018 2 commits
  7. 30 Apr, 2018 3 commits
    • Tim Wickberg's avatar
      Remove _task_sleep() from slurm_jobacct_gather.c. · 3be9e1ee
      Tim Wickberg authored
      The use in _watch_tasks needs to be removed as the switch to pthread_signal
      from pthread_cancel means this will not get interrupted and would keep the
      step alive for at least a second, potentially harming throughput.
      Since the call to _poll_data() happens after the first timer expires,
      this delay turns out to be unnecessary, so we won't be replacing it with
      a pthread_cond_timedwait() construct.
      
      The use jobacct_gather_stat_task() is unnecessary since the two locations
      this can happen take place after _fork_all_tasks() has setup the tasks,
      thus the delay should not be necessary.
      
      Bug 5103.
      3be9e1ee
    • Tim Wickberg's avatar
      Remove unsafe use of pthread_cancel() in slurmstepd. · a7c8964e
      Tim Wickberg authored
      These functions are not async-cancel-safe, and cannot safely be cancelled.
      This leads to potential deadlock, either between our own locks, or deep
      inside glibc when the thread held a malloc arena lock when canceled.
      
      Replace with pthread_signal to the appropriate cond to
      wake threads up at the appropriate time instead.
      
      Bug 5103.
      a7c8964e
    • Danny Auble's avatar
      Make a global for each of the accounting gather profile timers. · 1675ada0
      Danny Auble authored
      This will make it easier in a future commit to avoid the
      async pthread_cancel.
      
      Bug 5103
      1675ada0