- 13 Dec, 2016 1 commit
-
-
Tim Wickberg authored
Reverts most of commit 84023f27. Searching the PATH in slurmd can fail due to root_squash'd NFS filesystems, leading to the "wrong" program being launched. If you'd like the performance benefit from avoiding this lookup during each separate task launch, set SLURM_TEST_EXEC=1 instead which will perform the lookup once within srun, which then ensures the lookup happens under the users own environment and not that of the slurmd. Bug 2992.
-
- 09 Dec, 2016 1 commit
-
-
Danny Auble authored
level.
-
- 08 Dec, 2016 6 commits
-
-
Danny Auble authored
-
Tim Wickberg authored
If the second call to getgrouplist() found additional groups, ngroups will be overwritten with this new larger value, while the gids list would be truncated. (ngroups is a value-result arg.) This will then lead to _gids_cache_lookup() returning the wrong number of groups including invalid parts of memory, which are likely to include some zeros. Those zeros could then make it to the setgroups() call and thus give the user access to the root group. Especially as setgroups will succeed as long as the array does not contain -1 as a gid. Bug 3320.
-
Tim Wickberg authored
-
Danny Auble authored
-
Morris Jette authored
task/cgroup - Change error message if CPU binding can not take place to better identify the root cause of the problem. Specifically, if the hwloc_get_obj_below_by_type() function call completely fails that is likely due to task/affinity not being configured, so cpusets are not configured. Previous message was "task/cgroup: task[%u] infinite loop broken while trying to provision compute elements using %s (bitmap:%s)" The new message is "task/cgroup: hwloc_get_obj_below_by_type() failing, task/affinity plugin also required"
-
Dominik Bartkiewicz authored
uint32_t needs %u on 32-bit platforms. Noticed by clang/travisci.
-
- 07 Dec, 2016 2 commits
-
-
Danny Auble authored
Bug 3258
-
Danny Auble authored
This reverts commit 817c2ca4. # Conflicts: # NEWS
-
- 06 Dec, 2016 7 commits
-
-
Danny Auble authored
a slurmctld restart or reconfig, as they aren't really error messages. Bug 3258
-
Danny Auble authored
Bug 3258
-
Morris Jette authored
Done jost to run "git push" again after internal github error on previous push: remote: Resolving deltas: 100% (4/4), completed with 4 local objects. remote: Unexpected system error after push was received. remote: These changes may not be reflected on github.com! remote: Your unique error code: bdecb7b0f321368fe1f037a81a6e9c2c
-
Tim Wickberg authored
Note that this does not protect against all possible problems here. The setgroups() call in Linux at least is willing to set any gid_t value except -1 on a group, so calls will not always fail on corrupted group lists. Bug 3320.
-
Tim Wickberg authored
Remove uncached _get_grouplist() call which was only used here. Bug 3315.
-
Morris Jette authored
Fix parsing in regression test1.92 for some prompts. bug 2792
-
Morris Jette authored
Recognize a KNL's proper NUMA count (rather than setting it to the value in slurm.conf) when using FastSchedule=0. Previous logic would change the NUMA count on the node to match what was in slurm.conf, which would mess up task layout with respect to the sockets. bug 3306
-
- 05 Dec, 2016 2 commits
-
-
Danny Auble authored
from the slurm.conf when using FastSchedule=0.
-
Morris Jette authored
cray/burst_buffer - If slurmctld daemon restarts with pending job and burst buffer having unknown file stage-in status, teardown the buffer, defer the job, and start stage-in over again. bug 3295
-
- 02 Dec, 2016 3 commits
-
-
Danny Auble authored
bug 3314
-
Danny Auble authored
-
Danny Auble authored
-
- 01 Dec, 2016 2 commits
-
-
Dominik Bartkiewicz authored
limits after the node selection to make sure it doesn't violate those limits and if it does change the reason for waiting so we don't reserve resources on jobs violating accounting limits. Bug 3029
-
Morris Jette authored
node_features/knl_cray - Fix possible race condition when changing node state that could result in old KNL mode as an active features. bug 3235
-
- 30 Nov, 2016 2 commits
-
-
Morris Jette authored
cray/burst_buffer - Increase time to synchronize operations between threads from 5 to 60 seconds ("setup" operation time observed over 17 seconds). This should fix a race condition between a thread performing a buffer creation (setup) and a thread looking for unexpected buffers. If a buffer is found during the time window allowed for creation, it's space will be counted twice. First by the status checking thread and second by the thread doing the creation. The deallocation only happens once, so the used space information can be left with an invalid value. bug 3295
-
Tim Wickberg authored
static variable means multiple active decompression streams will corrupt zlib's internal state, which can lead to a segfault. Bug 3299.
-
- 29 Nov, 2016 1 commit
-
-
Alejandro Sanchez authored
On a reconfig, the exc_node_bitmap is cleared but then it was not built again since last_work_scan was declared as a local static variable in _do_power_work(). The fix is to make it global within the plugin and reinitialize it to 0 on _init_power_config(). Bug 3078.
-
- 28 Nov, 2016 3 commits
-
-
Alejandro Sanchez authored
-
Dominik Bartkiewicz authored
Bug 3267.
-
Dominik Bartkiewicz authored
Termination can race against step creation if, e.g., ill-behaved SPANK plugins are in use. Bug 3248.
-
- 22 Nov, 2016 5 commits
-
-
Morris Jette authored
sched/backfill plugin: Make malloc match data type (defined as uint32_t and allocated as int). No failures observed, if type "int" is smaller than "uint32_t", it could result in an invalid memory reference.
-
Sergey Meirovich authored
Fix API call: slurm_job_cpus_allocated_str_on_node_id() and in turn slurm_job_cpus_allocated_str_on_node() to return correct results for anything but first node. This was caused by missed logic to calculate fist bit belongs to particular node. Lookup was always starting from bit 0. Bug 3266.
-
Morris Jette authored
After one second of wall time, simulate the termination of all remaining running jobs in order to respond in a reasonable time frame. bug 3275
-
Morris Jette authored
Modify backfill algorithm to improve performance with large numbers of running jobs. Group running jobs that end in a "similar" time frame using a time window that grows exponentially rather than linearly. The original window sizes were (in units of minutes): 0, 1, 2, 3, 4, 5, 6, 7, ... minutes The new window sizes are: 0.5, 1, 2, 4, 8, 16, 32, ... minutes This can dramatically reduce the number of instances where the very time consuming "can the pending job run now" operation is executed, especailly if there are 1000+ running jobs. bug 3275
-
Nicolas Joly authored
-
- 14 Nov, 2016 1 commit
-
-
Morris Jette authored
If a node is booting for some job, don't allocate additional jobs to the node until the boot completes. but 3256
-
- 13 Nov, 2016 1 commit
-
-
Alejandro Sanchez authored
Found with valgrind. Bug 2846.
-
- 11 Nov, 2016 3 commits
-
-
Morris Jette authored
Move where we set the configuration table bitmaps in order to support the backup slurmctld starting and recovering previously saved KNL mode information (which can necessitate rebuilding the node configuration table). bug 3241
-
Tim Wickberg authored
Bug 3255.
-
David Gloe authored
Bug 3253.
-