- 23 Jan, 2018 7 commits
-
-
Morris Jette authored
Much of the logic does work, but better to prevent users from trying this and failing in some way
-
Morris Jette authored
-
Morris Jette authored
-
Alejandro Sanchez authored
Reported by Jenkins after 17.11 merge 93e67a13.
-
Alejandro Sanchez authored
-
Alejandro Sanchez authored
Commit 818a09e8 introduced a new state JOB_OOM and a new state reason FAIL_OOM (OutOfMemory). The problem was that it based the decision upon the value of the different memory.[*].failcnt being > 0. That lead to "false positives" situations when the usage hit the limit but the Kernel was able to reclaim pages and the process managed to finish successfully. When this happens there might not necessary be OOM_KILL events happening. This patch makes it so the JOB_OOM state is set based upon OOM_KILL events detected instead of usage hitting the limit. The usage hit will still be logged as an info() message, and further work will be needed in the master branch to better discern both type of events, maybe changing the API and getting rid of the current SIG_OOM and a potential new SIG_OOM_KILL. OOM_KILL event is detected using the eventfd notification mechanism on the cgroup v1 control/event files: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt If we plan to support cgroup v2, we should monitor 'memory.events' file modified events. That would mean that any of the available entries changed its value upon notification. Entries include: low, high, max, oom, oom_kill: https://www.kernel.org/doc/Documentation/cgroup-v2.txt https://patchwork.kernel.org/patch/9737381 but since this is a fairly recent change many sites might be running kernels still not supporting this feature. Bug 3820.
-
Brian Christiansen authored
-
- 22 Jan, 2018 7 commits
-
-
Danny Auble authored
-
Morris Jette authored
Correct SLURM_NTASKS and SLURM_NPROCS environment variable for heterogeneous job step. Report values representing full allocation.
-
Alejandro Sanchez authored
Bug 4656.
-
Danny Auble authored
This reverts commit d3141dc9. Bug 4655 Turns out there are many ways to get this information directly from the slurmstepd. As you can already get this information from ps we decided to just revert back to the old non-authenticated way of doing things. If we do need this in the future we need to patch the stepd as well as the slurmd here in all the RPC's that try to grab this. A user could easily run scontrol (or their own home baked thing) on the node which will give them a direct contact with the slurmstepd.
-
Danny Auble authored
This reverts commit c4fb9bc3.
-
Danny Auble authored
This reverts commit d3141dc9. Bug 4655 Turns out there are many ways to get this information directly from the slurmstepd. As you can already get this information from ps we decided to just revert back to the old non-authenticated way of doing things. If we do need this in the future we need to patch the stepd as well as the slurmd here in all the RPC's that try to grab this. A user could easily run scontrol (or their own home baked thing) on the node which will give them a direct contact with the slurmstepd.
-
Danny Auble authored
Bug 4656
-
- 19 Jan, 2018 26 commits
-
-
Morris Jette authored
Introduced in commit f2efbb60 Coverity CID 182262, 182263 and 182264
-
Morris Jette authored
bug 4607
-
Brian Christiansen authored
From 5eff1498
-
Alejandro Sanchez authored
-
Danny Auble authored
-
Alejandro Sanchez authored
Bugs 2693 and 3795.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
This reverts commit 33596bc782f254ae56ab0f61fd736308b6980edf. slurm_xrealloc will internally do this on new size allocation error: log_oom(file, line, func); abort(); so no need to manually handle it. Bug 2693.
-
Alejandro Sanchez authored
Using curl_easy_getinfo with CURLINFO_RESPONSE_CODE is a cleaner and safer way to retrieve the response code, since we don't have to manually parse the header with strtok. This permits to disable the CURLOPT_HEADER option, and thus chunk.message will directly contain the JSON response body without the header, which can also be directly logged if more verbosity is needed to know why a non-2xx response code was received (database not found, ...). Bug 2693.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
- Move var declarations to beginning. - Add some goto cleanup* statements. Bug 2693.
-
Alejandro Sanchez authored
==6281== 36 bytes in 2 blocks are definitely lost in loss record 123 of 292 ==6281== at 0x4C2FB45: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==6281== by 0x502B72F: slurm_xmalloc (xmalloc.c:84) ==6281== by 0x502CA2A: xstrdup (xstring.c:350) ==6281== by 0x7F9552C: ??? ==6281== by 0x4F3419D: acct_gather_profile_g_create_dataset (slurm_acct_gather_profile.c:648) ... ==6281== 160 bytes in 2 blocks are definitely lost in loss record 234 of 292 ==6281== at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==6281== by 0x502B8AB: slurm_xrealloc (xmalloc.c:143) ==6281== by 0x7F955C8: ??? ==6281== by 0x4F3419D: acct_gather_profile_g_create_dataset (slurm_acct_gather_profile.c:648) ... ==6281== 552 (160 direct, 392 indirect) bytes in 2 blocks are definitely lost in loss record 257 of 292 ==6281== at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==6281== by 0x502B8AB: slurm_xrealloc (xmalloc.c:143) ==6281== by 0x7F95581: ??? ==6281== by 0x4F3419D: acct_gather_profile_g_create_dataset (slurm_acct_gather_profile.c:648) Bug 2693.
-
Alejandro Sanchez authored
This option will define the database retention policy that should be used when writing data to influxdb. Bug 2693.
-
Alejandro Sanchez authored
Previously the function was called just once at init() and stored. Since value can be changed dynamically with 'scontrol setdebugflags' then we should not reuse the initial value. Bug 2693.
-
Alejandro Sanchez authored
Bug 2693.
-
Alejandro Sanchez authored
-
Alejandro Sanchez authored
Original source: https://github.com/cfenoy/influxdb-slurm-monitoring
-
Morris Jette authored
Bug 4619.
-