- 29 Aug, 2017 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
relies on the primary to do so. There is a potential race condition if the backup DBD tries to create/check the database at the same time as the primary. This patch removes this race by not allowing the backup to do the check/create. Bug 3827
-
- 24 Aug, 2017 1 commit
-
-
Alejandro Sanchez authored
Calling bit_unfmt() with a zero bit_size() bitmap leads to a later call to bit_nclear() with start=0 and stop=-1, leading to the ABRT. This scenario happened when cgroup.conf has ConstrainDevices=yes and task_cgroup_devices_create() tries to collect the GRES devices but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt); of zero size which is passed by argument to bit_unfmt(). gres_cpu_cnt is 0 because we have defined a gres.conf like this: Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1 Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1 Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3 Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3 but have no GresTypes nor GRES option in the slurm.conf / node config def. Bug 3974
-
- 23 Aug, 2017 1 commit
-
-
Alejandro Sanchez authored
Running slurmctld under valgrind while operating with jobcomp/elasticsearch reported the following bytes definitely lost: ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342 ==27403== at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==27403== by 0x2281B3: slurm_xrealloc (xmalloc.c:137) ==27403== by 0x22856A: makespace (xstring.c:114) ==27403== by 0x2285D0: _xstrcat (xstring.c:132) ==27403== by 0x228CE0: _xstrfmtcat (xstring.c:291) ==27403== by 0x83C5BCD: ??? ==27403== by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172) ==27403== by 0x18D8FC: job_completion_logger (job_mgr.c:13652) It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to the corresponding job_node->serialized_job, but the originally generated buffer wasn't freed afterwards. The fix consists in change the transfer so that instead of xstrdup'ing the char * we just assign the pointer and NULL the buffer. The job_node->serialized_job was already xfree'd properly later when the job was indexed. Discovered while working on Bug 4065.
-
- 22 Aug, 2017 2 commits
-
-
Alejandro Sanchez authored
Otherwise the resulting URL may be invalid. Update documentation while here as well. Bug 4065.
-
Philip Kovacs authored
bug 4095
-
- 21 Aug, 2017 2 commits
-
-
Isaac Hartung authored
Print numbers using exponential format if required to fit in allocated field width. The sacctmgr and sshare commands are impacted. bug 1749
-
Alejandro Sanchez authored
Given a configuration with TopologyParam including Dragonfly option, if a job requested --switches count, the count timeout specified by either the job request or max_switch_wait SchedulerParameters was not respected. This was due to leaf_switch_count variable not being incremented in _eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(), the later being a execution path which already succeed to wait for the switch count timeout. Bug 4056
-
- 18 Aug, 2017 2 commits
-
-
Brian Christiansen authored
-
Alejandro Sanchez authored
Add the following fields as environment variables: CLUSTER, DEPENDENCY, DERIVEDEC, EXITCODE, GROUPNAME, QOS, RESERVATION, USERNAME. LIMIT env variable value format (which means the TimeLimit of the job) has been modified to D-HH:MM:SS. Bug 3942
-
- 17 Aug, 2017 1 commit
-
-
Morris Jette authored
Coverity CID 44649 Bug 4085
-
- 16 Aug, 2017 1 commit
-
-
Danny Auble authored
instead of local. Bug 3546
-
- 15 Aug, 2017 4 commits
-
-
Morris Jette authored
-
Morris Jette authored
bug 3217
-
Morris Jette authored
-
Morris Jette authored
If srun lacks application specification for some component, the next one specified will be used for earlier components.
-
- 14 Aug, 2017 3 commits
-
-
Morris Jette authored
-
Danny Auble authored
This reverts commit 00a691b9.
-
Morris Jette authored
-
- 12 Aug, 2017 1 commit
-
-
Morris Jette authored
Modify scontrol job hold/release and update to operate with heterogeneous job id specification (e.g. "scontrol hold 123+4").
-
- 11 Aug, 2017 5 commits
-
-
Alejandro Sanchez authored
Fix sview to avoid messages to stderr when modifying a block, partition, or reservation. bug 3217
-
Danny Auble authored
This will allow dell's custom syscfg to work correctly. NOTE: Dell calls flat memory just memory. Bug 4034
-
Morris Jette authored
Doing so would break the current scheduling logic.
-
Danny Auble authored
Bug 4059
-
Dominik Bartkiewicz authored
-
- 10 Aug, 2017 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
- 07 Aug, 2017 2 commits
-
-
Danny Auble authored
-
Dominik Bartkiewicz authored
Bug 4019
-
- 04 Aug, 2017 6 commits
-
-
Morris Jette authored
truncation of core specification and not reserving the specified cores. Fixes Coverity CID 45174 and 45175 Bug 4053
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
the tree. Bug 4050
-
Morris Jette authored
Modify launch/slurm plugin to signal all components of a pack job rather than just the one (modify to use a list of step context records).
-
Morris Jette authored
If prolog is running when attempting to signal a step, then return EAGAIN and retry rather than simply returning SLURM_ERROR and aborting.
-
- 03 Aug, 2017 1 commit
-
-
Morris Jette authored
Fix I/O race condition on step termination for srun launching multiple pack job groups. Without this change application output might be lost and/or the srun command might hang after some tasks exit.
-
- 02 Aug, 2017 4 commits
-
-
Marshall Garey authored
Would fail when trying to create the clustername file because the StateSaveLocation path didn't exist yet. Bug 3988
-
Marshall Garey authored
srun jobs that could start immediately and requested multiple partitions didn't run in the highest priority partition if the highest priority partition wasn't listed first. It's possible that the scontrol show jobs will show the partition list in priority order now that the job's partition list gets sorted by priority. Bug 4015
-
Tim Wickberg authored
Bug 3956.
-
Morris Jette authored
Add pack_job_id and pack_job_offset to accounting database. Modified sacct to accept pack job ID specification using "#+#" notation. Modified sstat to accept pack job ID specification using "#+#" notation.
-