- 23 Jan, 2018 2 commits
-
-
Alejandro Sanchez authored
Commit 818a09e8 introduced a new state JOB_OOM and a new state reason FAIL_OOM (OutOfMemory). The problem was that it based the decision upon the value of the different memory.[*].failcnt being > 0. That lead to "false positives" situations when the usage hit the limit but the Kernel was able to reclaim pages and the process managed to finish successfully. When this happens there might not necessary be OOM_KILL events happening. This patch makes it so the JOB_OOM state is set based upon OOM_KILL events detected instead of usage hitting the limit. The usage hit will still be logged as an info() message, and further work will be needed in the master branch to better discern both type of events, maybe changing the API and getting rid of the current SIG_OOM and a potential new SIG_OOM_KILL. OOM_KILL event is detected using the eventfd notification mechanism on the cgroup v1 control/event files: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt If we plan to support cgroup v2, we should monitor 'memory.events' file modified events. That would mean that any of the available entries changed its value upon notification. Entries include: low, high, max, oom, oom_kill: https://www.kernel.org/doc/Documentation/cgroup-v2.txt https://patchwork.kernel.org/patch/9737381 but since this is a fairly recent change many sites might be running kernels still not supporting this feature. Bug 3820.
-
Brian Christiansen authored
-
- 22 Jan, 2018 6 commits
-
-
Danny Auble authored
-
Alejandro Sanchez authored
Bug 4656.
-
Danny Auble authored
This reverts commit d3141dc9. Bug 4655 Turns out there are many ways to get this information directly from the slurmstepd. As you can already get this information from ps we decided to just revert back to the old non-authenticated way of doing things. If we do need this in the future we need to patch the stepd as well as the slurmd here in all the RPC's that try to grab this. A user could easily run scontrol (or their own home baked thing) on the node which will give them a direct contact with the slurmstepd.
-
Danny Auble authored
This reverts commit c4fb9bc3.
-
Danny Auble authored
This reverts commit d3141dc9. Bug 4655 Turns out there are many ways to get this information directly from the slurmstepd. As you can already get this information from ps we decided to just revert back to the old non-authenticated way of doing things. If we do need this in the future we need to patch the stepd as well as the slurmd here in all the RPC's that try to grab this. A user could easily run scontrol (or their own home baked thing) on the node which will give them a direct contact with the slurmstepd.
-
Danny Auble authored
Bug 4656
-
- 19 Jan, 2018 1 commit
-
-
Morris Jette authored
Bug 4619.
-
- 18 Jan, 2018 7 commits
-
-
Morris Jette authored
switch count. Bug 4381
-
Felip Moll authored
Bug 4620
-
Danny Auble authored
Bug 4620
-
Danny Auble authored
find jobs that ran on specific nodes. Bug 4602
-
Danny Auble authored
needed if you are ever working with multi-dimensional systems. Bug 4602
-
Danny Auble authored
This isn't a real problem, but older compilers will complain about it. Newer compilers know in order to get into the place it could be used 'have_count' would have to be set. If that is set then feature_list would also be set.
-
Danny Auble authored
itself. Bug 4638
-
- 17 Jan, 2018 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
Test 3.11, file inc3.11.9 only runs for some configurations, but assumes no leading zeros in node name suffix. When run with nodes named "nid[00001-00005]", the test converted the last number to it's numeric for and as making requests for "nid[00001-5]", which would fail.
-
Morris Jette authored
-
- 16 Jan, 2018 5 commits
-
-
Morris Jette authored
Fix output file containing "%t" (task ID) for heterogeneous job step to be based upon global task ID rather than task ID for that component of the heterogeneous job step.
-
Morris Jette authored
This expands some comments Explicitly sets some pointers to NULL after memcpy (these are redundant, but add clarity) and Move a memcpy to avoid modifying the wrong values
-
Morris Jette authored
-
Danny Auble authored
-
Morris Jette authored
Fix for possible memory corruption in srun when running heterogeneous job steps. bug 4626
-
- 12 Jan, 2018 9 commits
-
-
Dominik Bartkiewicz authored
This partially reverts 89bcd975 and aac6bd39 Turns out you can't use a list_for_each and lock something inside the list_for_each function that does a lock without the write lock. Bug 4611
-
Danny Auble authored
This reverts commit ff3e77f4.
-
Danny Auble authored
This partially reverts 89bcd975 and aac6bd39 Turns out you can't use a list_for_each and lock something inside the list_for_each function that does a lock without the write lock. Bug 4611
-
Tim Wickberg authored
Otherwise undocumented, and does not do anything. Will be removed in 18.08.
-
Morris Jette authored
-
Morris Jette authored
-
Dominik Bartkiewicz authored
Fix job array dependency with "aftercorr" option and some task arrays in the first job fail. This fix lets all task array elements that can run proceed rather than stopping all subsequent task array elements. Bug 4590
-
Felip Moll authored
Creating a copy of the actual environment in env->env defines a new pointer, then next call to setup_env and setenvf doesn't define variables in the global environment but in this new copy. Bug 4615
-
Morris Jette authored
This fixes the problem introduced in by commit 777a45f9 and maintains proper PMIx operation. bugs 4132 and 4615
-
- 11 Jan, 2018 6 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
node_feature/knl_cray - Fix memory leak that can occur during normal operation. This will happen when an update request for a specific node happens.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
If CnselectPath and/or SyscfgPath defined in knl_cray.conf file and slurmctld reconfigured, the original values of those paraemters would be over-written and their memory leaked.
-
- 10 Jan, 2018 1 commit
-
-
Felip Moll authored
Use FREE_NULL_BUFFER instead, otherwise we could attempt to free_buffer a second time if we jump to the rwfail label. bug4491
-