1. 13 Feb, 2018 1 commit
  2. 12 Feb, 2018 1 commit
  3. 08 Feb, 2018 1 commit
  4. 07 Feb, 2018 9 commits
  5. 06 Feb, 2018 3 commits
  6. 05 Feb, 2018 1 commit
  7. 01 Feb, 2018 3 commits
  8. 30 Jan, 2018 9 commits
  9. 29 Jan, 2018 3 commits
  10. 25 Jan, 2018 3 commits
  11. 24 Jan, 2018 2 commits
  12. 23 Jan, 2018 1 commit
    • Alejandro Sanchez's avatar
      task/cgroup - add support to detect OOM_KILL cgroup events. · 943c4a13
      Alejandro Sanchez authored
      Commit 818a09e8 introduced a new state JOB_OOM and a new state reason
      FAIL_OOM (OutOfMemory). The problem was that it based the decision upon
      the value of the different memory.[*].failcnt being > 0.
      
      That lead to "false positives" situations when the usage hit the limit
      but the Kernel was able to reclaim pages and the process managed to finish
      successfully. When this happens there might not necessary be OOM_KILL
      events happening.
      
      This patch makes it so the JOB_OOM state is set based upon OOM_KILL events
      detected instead of usage hitting the limit. The usage hit will still
      be logged as an info() message, and further work will be needed in the
      master branch to better discern both type of events, maybe changing
      the API and getting rid of the current SIG_OOM and a potential new
      SIG_OOM_KILL.
      
      OOM_KILL event is detected using the eventfd notification mechanism
      on the cgroup v1 control/event files:
      https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
      
      If we plan to support cgroup v2, we should monitor 'memory.events' file
      modified events. That would mean that any of the available entries changed
      its value upon notification.
      Entries include: low, high, max, oom, oom_kill:
      https://www.kernel.org/doc/Documentation/cgroup-v2.txt
      https://patchwork.kernel.org/patch/9737381
      but since this is a fairly recent change many sites might be running
      kernels still not supporting this feature.
      
      Bug 3820.
      943c4a13
  13. 22 Jan, 2018 3 commits
    • Danny Auble's avatar
      Revert "Fix uid check when requesting a jobid from a pid." · 5b1d77fb
      Danny Auble authored
      This reverts commit d3141dc9.
      
      Bug 4655
      
      Turns out there are many ways to get this information directly from
      the slurmstepd.  As you can already get this information from ps we
      decided to just revert back to the old non-authenticated way of doing
      things.
      
      If we do need this in the future we need to patch the stepd as well as
      the slurmd here in all the RPC's that try to grab this.
      
      A user could easily run scontrol (or their own home baked thing)
      on the node which will give them a direct contact with the slurmstepd.
      5b1d77fb
    • Danny Auble's avatar
      Revert "Revert "Fix uid check when requesting a jobid from a pid."" · 4a0f4796
      Danny Auble authored
      This reverts commit c4fb9bc3.
      4a0f4796
    • Danny Auble's avatar
      Revert "Fix uid check when requesting a jobid from a pid." · c4fb9bc3
      Danny Auble authored
      This reverts commit d3141dc9.
      
      Bug 4655
      
      Turns out there are many ways to get this information directly from
      the slurmstepd.  As you can already get this information from ps we
      decided to just revert back to the old non-authenticated way of doing
      things.
      
      If we do need this in the future we need to patch the stepd as well as
      the slurmd here in all the RPC's that try to grab this.
      
      A user could easily run scontrol (or their own home baked thing)
      on the node which will give them a direct contact with the slurmstepd.
      c4fb9bc3