- 08 Dec, 2016 9 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
Revise commit e6ef1f0c The root cause of the failure is a bug in HWLOC that has since been fixed in HWLOC version 1.11.5.
-
Tim Wickberg authored
If the second call to getgrouplist() found additional groups, ngroups will be overwritten with this new larger value, while the gids list would be truncated. (ngroups is a value-result arg.) This will then lead to _gids_cache_lookup() returning the wrong number of groups including invalid parts of memory, which are likely to include some zeros. Those zeros could then make it to the setgroups() call and thus give the user access to the root group. Especially as setgroups will succeed as long as the array does not contain -1 as a gid. Bug 3320.
-
Tim Wickberg authored
-
Danny Auble authored
-
Morris Jette authored
task/cgroup - Change error message if CPU binding can not take place to better identify the root cause of the problem. Specifically, if the hwloc_get_obj_below_by_type() function call completely fails that is likely due to task/affinity not being configured, so cpusets are not configured. Previous message was "task/cgroup: task[%u] infinite loop broken while trying to provision compute elements using %s (bitmap:%s)" The new message is "task/cgroup: hwloc_get_obj_below_by_type() failing, task/affinity plugin also required"
-
Dominik Bartkiewicz authored
uint32_t needs %u on 32-bit platforms. Noticed by clang/travisci.
-
- 07 Dec, 2016 3 commits
-
-
Danny Auble authored
Bug 3258
-
Danny Auble authored
This reverts commit 55cb7973.
-
Danny Auble authored
This reverts commit 817c2ca4. # Conflicts: # NEWS
-
- 06 Dec, 2016 14 commits
-
-
Danny Auble authored
a slurmctld restart or reconfig, as they aren't really error messages. Bug 3258
-
Danny Auble authored
Bug 3258
-
Danny Auble authored
-
Morris Jette authored
Done jost to run "git push" again after internal github error on previous push: remote: Resolving deltas: 100% (4/4), completed with 4 local objects. remote: Unexpected system error after push was received. remote: These changes may not be reflected on github.com! remote: Your unique error code: bdecb7b0f321368fe1f037a81a6e9c2c
-
Morris Jette authored
This restores the socket count check at node registration for non-KNL systems (at least systems without NodeFeaturesPlugins type that includes "knl"). This is a refinement of commit 1ce9a7c4
-
Tim Wickberg authored
Note that this does not protect against all possible problems here. The setgroups() call in Linux at least is willing to set any gid_t value except -1 on a group, so calls will not always fail on corrupted group lists. Bug 3320.
-
Tim Wickberg authored
Remove uncached _get_grouplist() call which was only used here. Bug 3315.
-
Morris Jette authored
test12.2 was consistently failing on smd# cluster with tiny differences in the disk read and written. This change permits those tiny discrepancies to exist without failing the test. Here are the numbers, which are consistent: sacct --noheader -p --job=763.0 --format MaxDiskWrite,AveDiskWrite,MaxDiskRead,AveDiskRead 10.00M|10.00M|10.03M|10.03M| (i.e. 0.3% discrepancy, up to 0.5% allowed with current code)
-
Morris Jette authored
Test was failing due to hitting the memory limit (at least on smd1).
-
Morris Jette authored
There were already several configurations that could cause this test to fail. I just added another.
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
Fix parsing in regression test1.92 for some prompts. bug 2792
-
Morris Jette authored
Recognize a KNL's proper NUMA count (rather than setting it to the value in slurm.conf) when using FastSchedule=0. Previous logic would change the NUMA count on the node to match what was in slurm.conf, which would mess up task layout with respect to the sockets. bug 3306
-
- 05 Dec, 2016 4 commits
-
-
Morris Jette authored
HWLOC is required to properly determine topology
-
Danny Auble authored
from the slurm.conf when using FastSchedule=0.
-
Morris Jette authored
cray/burst_buffer - If slurmctld daemon restarts with pending job and burst buffer having unknown file stage-in status, teardown the buffer, defer the job, and start stage-in over again. bug 3295
-
Morris Jette authored
Add more detail to log message and change from error to debug2 with an explanation of how this happens
-
- 02 Dec, 2016 3 commits
-
-
Danny Auble authored
bug 3314
-
Danny Auble authored
-
Danny Auble authored
-
- 01 Dec, 2016 5 commits
-
-
Dominik Bartkiewicz authored
-
Dominik Bartkiewicz authored
limits after the node selection to make sure it doesn't violate those limits and if it does change the reason for waiting so we don't reserve resources on jobs violating accounting limits. Bug 3029
-
Morris Jette authored
-
Nicolas Joly authored
Bug 3301.
-
Morris Jette authored
node_features/knl_cray - Fix possible race condition when changing node state that could result in old KNL mode as an active features. bug 3235
-
- 30 Nov, 2016 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
No change in logic
-