- 02 Oct, 2017 2 commits
-
-
Dominik Bartkiewicz authored
Move the check up a bit more where it'll do some good. Bug 4184.
-
Dominik Bartkiewicz authored
Bug 4146.
-
- 29 Sep, 2017 2 commits
-
-
Danny Auble authored
Bug 3467
-
Danny Auble authored
Bug 3567
-
- 27 Sep, 2017 2 commits
-
-
Danny Auble authored
gres listed in your slurm.conf but some in gres.conf. Bug 3974
-
Danny Auble authored
'type' but no file defined.
-
- 19 Sep, 2017 3 commits
-
-
Danny Auble authored
plugin when constraining devices.
-
Danny Auble authored
-
Danny Auble authored
correctly in sacct.
-
- 14 Sep, 2017 1 commit
-
-
Tim Wickberg authored
A second PMI2_Init() within the same step is invalid, and cannot succeed. Return an error code back to the client end, and close the fd to force the step to terminate immediately. Due to a bug in our libpmi code, just returning a cmd=response_to_init with an appropriate rc number will not tear down the connection properly, so send back something else that will trigger the error path. Bug 3520.
-
- 13 Sep, 2017 1 commit
-
-
Josh Samuelson authored
Bug 4154.
-
- 12 Sep, 2017 3 commits
-
-
Danny Auble authored
default path. This makes it so you don't always have to put AllowedDevicesFile in your cgroup.conf file if your etc dir is anything other than /etc/slurm.
-
Tim Wickberg authored
Adding a newline prevents this error: conftest.c:154:8: error: if statement has empty body [-Werror,-Wempty-body]
-
Alejandro Sanchez authored
remote cluster correctly determine the select type. Bug 2329
-
- 08 Sep, 2017 2 commits
-
-
Dominik Bartkiewicz authored
If /proc was inaccessible proc_name would leak. Put an explicit length cap in sprintf to avoid warning. The size is checked immediate before here so this is just making the 10-char limit explicit. Bug 4062.
-
Dominik Bartkiewicz authored
Bug 4062.
-
- 07 Sep, 2017 2 commits
-
-
Dominik Bartkiewicz authored
bug 3824
-
Morris Jette authored
Do not run the Node Health Check on termination of the external step as this happens when the job allocation ends and the job NHC will be executed anyway. Bug 4074
-
- 01 Sep, 2017 2 commits
-
-
Danny Auble authored
checked on submit. This only mattered when submitting a job to multiple partitions. Bug 4066
-
Danny Auble authored
on node 0. Bug 4035
-
- 24 Aug, 2017 1 commit
-
-
Alejandro Sanchez authored
Calling bit_unfmt() with a zero bit_size() bitmap leads to a later call to bit_nclear() with start=0 and stop=-1, leading to the ABRT. This scenario happened when cgroup.conf has ConstrainDevices=yes and task_cgroup_devices_create() tries to collect the GRES devices but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt); of zero size which is passed by argument to bit_unfmt(). gres_cpu_cnt is 0 because we have defined a gres.conf like this: Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1 Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1 Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3 Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3 but have no GresTypes nor GRES option in the slurm.conf / node config def. Bug 3974
-
- 23 Aug, 2017 1 commit
-
-
Alejandro Sanchez authored
Running slurmctld under valgrind while operating with jobcomp/elasticsearch reported the following bytes definitely lost: ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342 ==27403== at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==27403== by 0x2281B3: slurm_xrealloc (xmalloc.c:137) ==27403== by 0x22856A: makespace (xstring.c:114) ==27403== by 0x2285D0: _xstrcat (xstring.c:132) ==27403== by 0x228CE0: _xstrfmtcat (xstring.c:291) ==27403== by 0x83C5BCD: ??? ==27403== by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172) ==27403== by 0x18D8FC: job_completion_logger (job_mgr.c:13652) It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to the corresponding job_node->serialized_job, but the originally generated buffer wasn't freed afterwards. The fix consists in change the transfer so that instead of xstrdup'ing the char * we just assign the pointer and NULL the buffer. The job_node->serialized_job was already xfree'd properly later when the job was indexed. Discovered while working on Bug 4065.
-
- 22 Aug, 2017 2 commits
-
-
Alejandro Sanchez authored
Otherwise the resulting URL may be invalid. Update documentation while here as well. Bug 4065.
-
Philip Kovacs authored
bug 4095
-
- 21 Aug, 2017 1 commit
-
-
Alejandro Sanchez authored
Given a configuration with TopologyParam including Dragonfly option, if a job requested --switches count, the count timeout specified by either the job request or max_switch_wait SchedulerParameters was not respected. This was due to leaf_switch_count variable not being incremented in _eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(), the later being a execution path which already succeed to wait for the switch count timeout. Bug 4056
-
- 17 Aug, 2017 1 commit
-
-
Morris Jette authored
Coverity CID 44649 Bug 4085
-
- 16 Aug, 2017 1 commit
-
-
Danny Auble authored
instead of local. Bug 3546
-
- 15 Aug, 2017 1 commit
-
-
Morris Jette authored
-
- 14 Aug, 2017 3 commits
-
-
Morris Jette authored
-
Danny Auble authored
This reverts commit 00a691b9.
-
Morris Jette authored
-
- 11 Aug, 2017 3 commits
-
-
Danny Auble authored
This will allow dell's custom syscfg to work correctly. NOTE: Dell calls flat memory just memory. Bug 4034
-
Danny Auble authored
Bug 4059
-
Dominik Bartkiewicz authored
-
- 07 Aug, 2017 2 commits
-
-
Danny Auble authored
-
Dominik Bartkiewicz authored
Bug 4019
-
- 04 Aug, 2017 4 commits
-
-
Morris Jette authored
truncation of core specification and not reserving the specified cores. Fixes Coverity CID 45174 and 45175 Bug 4053
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
the tree. Bug 4050
-