- 12 Sep, 2017 2 commits
-
-
Morris Jette authored
Enable them onlyh with SchedulerParameters=enable_hetero_jobs OR MPI type is "none"
-
Brian Christiansen authored
Do pointer comparisons rather than strcmps. ~80x speedup Bug 3529 e.g. 1000 nodes 8000 tasks [Sep 11 14:24:15.873639 20992 srvcn 0x7f8c1cdda700] _task_layout_hostfile: hostfile processing took usec=2152678 (orig) [Sep 11 14:27:46.173424 20992 srvcn 0x7f8c1c6d3700] _task_layout_hostfile: hostfile processing took usec=2142997 (orig) [Sep 11 14:32:32.245420 4037 srvcn 0x7f12de4e4700] _task_layout_hostfile: hostfile processing took usec=26198 (node ptrs) [Sep 11 14:36:12.88769 4037 srvcn 0x7f12de6e6700] _task_layout_hostfile: hostfile processing took usec=25515 (node ptrs) [Sep 11 14:41:38.339162 4037 srvcn 0x7f132c8d5700] _task_layout_hostfile: hostfile processing took usec=27459 (node ptrs) [Sep 11 15:16:59.575189 1874 srvcn 0x7f3dae3f0700] _task_layout_hostfile: hostfile processing took usec=30129 (node ptrs) [Sep 11 15:20:50.365004 1874 srvcn 0x7f3dc8b34700] _task_layout_hostfile: hostfile processing took usec=29884 (node ptrs)
-
- 11 Sep, 2017 2 commits
-
-
Morris Jette authored
-
Ole H Nielsen authored
-
- 08 Sep, 2017 6 commits
-
-
Dominik Bartkiewicz authored
If /proc was inaccessible proc_name would leak. Put an explicit length cap in sprintf to avoid warning. The size is checked immediate before here so this is just making the 10-char limit explicit. Bug 4062.
-
Morris Jette authored
-
Tim Wickberg authored
If the network path to shared storage used for the StateSaveLocation is separate from that used to communicate with the cluster, both the primary and backup controllers can end up acting as master on loss of the cluster network. Alter the HA takeover code path to make sure that the job state save file is not still being updated by the primary slurmctld. If it is, refuse to takeover and retry again later. Bug 3592.
-
Dominik Bartkiewicz authored
Bug 4062.
-
Marshall Garey authored
-
Marshall Garey authored
Bug 4084
-
- 07 Sep, 2017 2 commits
-
-
Dominik Bartkiewicz authored
bug 3824
-
Morris Jette authored
Do not run the Node Health Check on termination of the external step as this happens when the job allocation ends and the job NHC will be executed anyway. Bug 4074
-
- 05 Sep, 2017 1 commit
-
-
Brian Christiansen authored
From ea621ec4
-
- 04 Sep, 2017 1 commit
-
-
Alejandro Sanchez authored
Initially job mem limits were tested at submission time through _validate_min_mem_partition() -> _valid_pn_min_mem(), but not tested again at scheduling time, thus leading to jobs incorrectly being scheduled against partitions where the job exceeded their MaxMemPer* limit (which can in turn be inherited from the system wide limit too). NOTE: New WAIT_PN_MEM_LIMIT job_state_reason enum component added to support this new waiting reason. Bug 2291.
-
- 02 Sep, 2017 1 commit
-
-
Morris Jette authored
Modify step create logic so that call components of a heterogeneous job launched by a single srun command have the same step ID value.
-
- 01 Sep, 2017 4 commits
-
-
Danny Auble authored
checked on submit. This only mattered when submitting a job to multiple partitions. Bug 4066
-
Danny Auble authored
on node 0. Bug 4035
-
Morris Jette authored
Prevent a heterogeneous job allocation from including the same nodes in multiple components (required by MPI jobs spanning components).
-
Tim Wickberg authored
To avoid a reliance on the slurm.conf file (as this command is designed to help you construct said file) the path to check the space was hard-coded as /tmp. But, if TmpFS is meant to be elsewhere we'd emit the wrong value here. Bug 3272.
-
- 31 Aug, 2017 1 commit
-
-
Alejandro Sanchez authored
unsupported core-based reservations.
-
- 30 Aug, 2017 3 commits
-
-
David Gloe authored
Statically linked Cray PMI applications still expect to use some file paths containing the old SLURM_ID_HASH format. Some Cray customers have certification requirements that make recompilation difficult. The attached patch defines a macro to convert the new SLURM_ID_HASH to the old format, and writes the files and symlinks necessary for statically linked Cray PMI applications to work. Bug 4114
-
Danny Auble authored
frequency on the batch step. Bug 4073 Also see Bug 3510
-
Tim Wickberg authored
-
- 29 Aug, 2017 5 commits
-
-
Brian Christiansen authored
Bug 4090
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
relies on the primary to do so. There is a potential race condition if the backup DBD tries to create/check the database at the same time as the primary. This patch removes this race by not allowing the backup to do the check/create. Bug 3827
-
- 25 Aug, 2017 2 commits
-
-
Morris Jette authored
These are required by OpenMPI
-
Morris Jette authored
Modify output of "--mpi=list" to avoid duplicates for version numbers in mpi/pmix plugin names.
-
- 24 Aug, 2017 2 commits
-
-
Morris Jette authored
Modify sbcast command and srun's --bcast option to support heterogeneous jobs. bug 4099
-
Alejandro Sanchez authored
Calling bit_unfmt() with a zero bit_size() bitmap leads to a later call to bit_nclear() with start=0 and stop=-1, leading to the ABRT. This scenario happened when cgroup.conf has ConstrainDevices=yes and task_cgroup_devices_create() tries to collect the GRES devices but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt); of zero size which is passed by argument to bit_unfmt(). gres_cpu_cnt is 0 because we have defined a gres.conf like this: Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1 Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1 Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3 Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3 but have no GresTypes nor GRES option in the slurm.conf / node config def. Bug 3974
-
- 23 Aug, 2017 1 commit
-
-
Alejandro Sanchez authored
Running slurmctld under valgrind while operating with jobcomp/elasticsearch reported the following bytes definitely lost: ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342 ==27403== at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==27403== by 0x2281B3: slurm_xrealloc (xmalloc.c:137) ==27403== by 0x22856A: makespace (xstring.c:114) ==27403== by 0x2285D0: _xstrcat (xstring.c:132) ==27403== by 0x228CE0: _xstrfmtcat (xstring.c:291) ==27403== by 0x83C5BCD: ??? ==27403== by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172) ==27403== by 0x18D8FC: job_completion_logger (job_mgr.c:13652) It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to the corresponding job_node->serialized_job, but the originally generated buffer wasn't freed afterwards. The fix consists in change the transfer so that instead of xstrdup'ing the char * we just assign the pointer and NULL the buffer. The job_node->serialized_job was already xfree'd properly later when the job was indexed. Discovered while working on Bug 4065.
-
- 22 Aug, 2017 2 commits
-
-
Alejandro Sanchez authored
Otherwise the resulting URL may be invalid. Update documentation while here as well. Bug 4065.
-
Philip Kovacs authored
bug 4095
-
- 21 Aug, 2017 2 commits
-
-
Isaac Hartung authored
Print numbers using exponential format if required to fit in allocated field width. The sacctmgr and sshare commands are impacted. bug 1749
-
Alejandro Sanchez authored
Given a configuration with TopologyParam including Dragonfly option, if a job requested --switches count, the count timeout specified by either the job request or max_switch_wait SchedulerParameters was not respected. This was due to leaf_switch_count variable not being incremented in _eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(), the later being a execution path which already succeed to wait for the switch count timeout. Bug 4056
-
- 18 Aug, 2017 3 commits
-
-
Morris Jette authored
-
Brian Christiansen authored
-
Alejandro Sanchez authored
Add the following fields as environment variables: CLUSTER, DEPENDENCY, DERIVEDEC, EXITCODE, GROUPNAME, QOS, RESERVATION, USERNAME. LIMIT env variable value format (which means the TimeLimit of the job) has been modified to D-HH:MM:SS. Bug 3942
-