1. 29 Jan, 2018 2 commits
  2. 25 Jan, 2018 3 commits
  3. 24 Jan, 2018 2 commits
  4. 23 Jan, 2018 1 commit
    • Alejandro Sanchez's avatar
      task/cgroup - add support to detect OOM_KILL cgroup events. · 943c4a13
      Alejandro Sanchez authored
      Commit 818a09e8 introduced a new state JOB_OOM and a new state reason
      FAIL_OOM (OutOfMemory). The problem was that it based the decision upon
      the value of the different memory.[*].failcnt being > 0.
      
      That lead to "false positives" situations when the usage hit the limit
      but the Kernel was able to reclaim pages and the process managed to finish
      successfully. When this happens there might not necessary be OOM_KILL
      events happening.
      
      This patch makes it so the JOB_OOM state is set based upon OOM_KILL events
      detected instead of usage hitting the limit. The usage hit will still
      be logged as an info() message, and further work will be needed in the
      master branch to better discern both type of events, maybe changing
      the API and getting rid of the current SIG_OOM and a potential new
      SIG_OOM_KILL.
      
      OOM_KILL event is detected using the eventfd notification mechanism
      on the cgroup v1 control/event files:
      https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
      
      If we plan to support cgroup v2, we should monitor 'memory.events' file
      modified events. That would mean that any of the available entries changed
      its value upon notification.
      Entries include: low, high, max, oom, oom_kill:
      https://www.kernel.org/doc/Documentation/cgroup-v2.txt
      https://patchwork.kernel.org/patch/9737381
      but since this is a fairly recent change many sites might be running
      kernels still not supporting this feature.
      
      Bug 3820.
      943c4a13
  5. 22 Jan, 2018 4 commits
    • Danny Auble's avatar
      Revert "Fix uid check when requesting a jobid from a pid." · 5b1d77fb
      Danny Auble authored
      This reverts commit d3141dc9.
      
      Bug 4655
      
      Turns out there are many ways to get this information directly from
      the slurmstepd.  As you can already get this information from ps we
      decided to just revert back to the old non-authenticated way of doing
      things.
      
      If we do need this in the future we need to patch the stepd as well as
      the slurmd here in all the RPC's that try to grab this.
      
      A user could easily run scontrol (or their own home baked thing)
      on the node which will give them a direct contact with the slurmstepd.
      5b1d77fb
    • Danny Auble's avatar
      Revert "Revert "Fix uid check when requesting a jobid from a pid."" · 4a0f4796
      Danny Auble authored
      This reverts commit c4fb9bc3.
      4a0f4796
    • Danny Auble's avatar
      Revert "Fix uid check when requesting a jobid from a pid." · c4fb9bc3
      Danny Auble authored
      This reverts commit d3141dc9.
      
      Bug 4655
      
      Turns out there are many ways to get this information directly from
      the slurmstepd.  As you can already get this information from ps we
      decided to just revert back to the old non-authenticated way of doing
      things.
      
      If we do need this in the future we need to patch the stepd as well as
      the slurmd here in all the RPC's that try to grab this.
      
      A user could easily run scontrol (or their own home baked thing)
      on the node which will give them a direct contact with the slurmstepd.
      c4fb9bc3
    • Danny Auble's avatar
      Fix issues when starting the backup slurmdbd. · 8bb58a31
      Danny Auble authored
      Bug 4656
      8bb58a31
  6. 19 Jan, 2018 1 commit
  7. 18 Jan, 2018 6 commits
  8. 16 Jan, 2018 3 commits
  9. 12 Jan, 2018 6 commits
  10. 11 Jan, 2018 4 commits
  11. 10 Jan, 2018 3 commits
  12. 08 Jan, 2018 3 commits
    • Dominik Bartkiewicz's avatar
    • Alejandro Sanchez's avatar
      Improve logic when summarizing job arrays mail notifications. · 5f1cc8a8
      Alejandro Sanchez authored
      When --mail-type option isn't requested with ARRAY_TASKS, we need somehow
      to summarize the different states each task finished in the array. We've
      added a new ARRAY_TASK_REQUEUED flag to the array_flags to indicate that
      at least one task was requeued. Also the logic now detects if at least
      one task failed and/or if otherwise all finished successfully.
      
      The patch also removes the RunTime from the the e-mail when summarizing
      whole array, since it doesn't make sense to specify just the RunTime
      of one of the tasks for this case.
      
      Patch also fixes when ARRAY_TASKS is specified, previously the mail
      notification for the master job task record included a range of
      ExitCodes for all the tasks. Since this option is not for summarizing,
      the patch makes it so only the range is shown when the option isn't
      specified.
      
      Bug 4539.
      5f1cc8a8
    • Morris Jette's avatar
      Fix for changing node features · 09e8b368
      Morris Jette authored
      Scheduling fix for changing node features without any NodeFeatures plugins.
      Bug 4577
      09e8b368
  13. 05 Jan, 2018 2 commits