- 10 May, 2016 24 commits
-
-
Tim Wickberg authored
Remove local mutex_* macros. Require <pthread.h> and fixup includes.
-
Tim Wickberg authored
Remove list_mutex_* macros. Require <pthread.h> and remove WITH_PTHREAD
-
Tim Wickberg authored
Remove cbuf_mutex_* macros. Require <pthread.h> and remove WITH_PTHREAD
-
Tim Wickberg authored
Replace all __CURRENT_FUNC__ references with c99 __func__ Remove NULL and bool definitions, use <stddef.h> and <stdbool.h> instead.
-
Tim Wickberg authored
-
Tim Wickberg authored
AF_SLURM => AF_INET; AF_INET is already commonly used. SLURM_INADDR_ANY macro never used.
-
Tim Wickberg authored
No functional change in theory. Cleanup headers to reduce the mostly-useless for some time now. 1) Remove common/{getopt.[ch],getopt1.c} files and remove from build. Use C99-required functions from <unistd.h> and <getopt.h>. 2) <inttypes.h> is required by C99 and has been required for some time. Remove #ifdef blocks and replace some older <stdint.h> includes. 3) <pthread.h> isn't optional at this point. PTHREAD_MUTEX_INITIALIZER is required throughout. 4) Use <limits.h> instead of <values.h> or <float.h> 5) <string.h> is required by C99. Remove long-deprecated <strings.h> includes.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
-
Alejandro Sanchez authored
-
Tim Wickberg authored
-
Danny Auble authored
# Conflicts: # src/plugins/select/cray/select_cray.c # testsuite/expect/test1.84
-
Brian Christiansen authored
-
Marlys Kohnke authored
for better robustness. This cray/select plugin code has been modified to remove a possible timing window where two aeld pthreads could exist, interfering with each other through the global aeld_running variable. An additional validity check has been added to the data provided to aeld through an alpsc_ev_set_application_info() call. If an error is returned from that call, only certain errors need the current socket connection closed to aeld and a new connection established. Other error returns will log an error message and keep the current session established with aeld.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Morris Jette authored
This might possibly be related to bug 2334, but it's a long shot.
-
Danny Auble authored
slurm.conf instead of all. If looking for specific addresses use TopologyParam options No*InAddrAny. This was broken in 15.08 with the advent of the referenced TopologyParams the commits 9378f195 and c5312f52 are no longer needed. Bug 2696
-
Brian Christiansen authored
Thread names can only be 16 characters long, plus we already know that the threads are from the slurmctld.
-
Brian Christiansen authored
-
- 09 May, 2016 6 commits
-
-
Danny Auble authored
-
Moe Jette authored
at the same time. Bug 2683 Turns out making a variable static in a function will make it not safe when dealing with threads.
-
Morris Jette authored
Used for development and testing purposes
-
Morris Jette authored
-
Morris Jette authored
burst_buffer/cray - Add support for rounding up the size of a buffer reqeust if the DataWarp configuration "equalize_fragments" is used. That option can significantly increase the size of a burst buffer allocation in order to create equal sized buffers on all server nodes for performance reasons. Support current only provided by Cray for job buffers, not persistent burst buffers.
-
Brian Christiansen authored
-
- 06 May, 2016 8 commits
-
-
Morris Jette authored
If node_feature/knl_cray plugin is configured and a GresType of "hbm" is not defined, then add it the the GRES tables. Without this, references to a GRES of "hbm" (either by a user or Slurm's internal logic) will generate error messages. bug 2708
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
John Thiltges authored
With slurm-15.08.10, we're seeing occasional segfaults in slurmstepd. The logs point to the following line: slurm-15.08.10/src/slurmd/slurmstepd/mgr.c:2612 On that line, _get_primary_group() is accessing the results of getpwnam_r(): *gid = pwd0->pw_gid; If getpwnam_r() cannot find a matching password record, it will set the result (pwd0) to NULL, but still return 0. When the pointer is accessed, it will cause a segfault. Checking the result variable (pwd0) to determine success should fix the issue.
-
Morris Jette authored
Note that Slurm can not support heterogenous core counts for each NUMA nodes. bug 2704
-
Marco Ehlert authored
I would like to mention a problem which seems to be a calculation bug of used_cores in slurm version 15.08.7 If a node is divided into 2 partitions using MaxCPUsPerNode like this slurm.conf configuration NodeName=n1 CPUs=20 PartitionName=cpu NodeName=n1 MaxCPUsPerNode=16 PartitionName=gpu NodeName=n1 MaxCPUsPerNode=4 I run into a strange scheduling situation. The situation occurs after a fresh restart of the slurmctld daemon. I start jobs one by one: case 1 systemctl restart slurmctld.service sbatch -n 16 -p cpu cpu.sh sbatch -n 1 -p gpu gpu.sh sbatch -n 1 -p gpu gpu.sh sbatch -n 1 -p gpu gpu.sh sbatch -n 1 -p gpu gpu.sh => Problem now: The gpu jobs are kept in PENDING state. This picture changes if I start the jobs this way case 2 systemctl restart slurmctld.service sbatch -n 1 -p gpu gpu.sh scancel <gpu job_id> sbatch -n 16 -p cpu cpu.sh sbatch -n 1 -p gpu gpu.sh sbatch -n 1 -p gpu gpu.sh sbatch -n 1 -p gpu gpu.sh sbatch -n 1 -p gpu gpu.sh and all jobs are running fine. By looking into the code I figured out a wrong calculation of 'used_cores' in function _allocate_sc() plugins/select/cons_res/job_test.c _allocate_sc(...) ... for (c = core_begin; c < core_end; c++) { i = (uint16_t) (c - core_begin) / cores_per_socket; if (bit_test(core_map, c)) { free_cores[i]++; free_core_count++; } else { used_cores[i]++; } if (part_core_map && bit_test(part_core_map, c)) used_cpu_array[i]++; This part of code seems to work only if the part_core_map exists for a partition or on a completly free node. But in case 1 there is no part_core_map for gpu created yet. Starting a gpu the core_map contains 4 cores left from the cpu job. Now all non free cores of the cpu partion are counted as used cores in the gpu partition and this condition will match in the next code parts free_cpu_count + used_cpu_count > job_ptr->part_ptr->max_cpus_per_node what is definitely wrong. As soon as a part_core_map appears, means a gpu job was started on a free node (case 2) then there is no problem at all. To get case 1 work I changed the above code to the following and all works fine: for (c = core_begin; c < core_end; c++) { i = (uint16_t) (c - core_begin) / cores_per_socket; if (bit_test(core_map, c)) { free_cores[i]++; free_core_count++; } else { if (part_core_map && bit_test(part_core_map, c)){ used_cpu_array[i]++; used_cores[i]++; } } I am not sure this code change is really good, but it fixes my problem.
-
Brian Christiansen authored
-
- 05 May, 2016 2 commits
-
-
Morris Jette authored
-
Tim Wickberg authored
-