- 08 Sep, 2017 2 commits
-
-
Marshall Garey authored
-
Marshall Garey authored
Bug 4084
-
- 07 Sep, 2017 2 commits
-
-
Dominik Bartkiewicz authored
bug 3824
-
Morris Jette authored
Do not run the Node Health Check on termination of the external step as this happens when the job allocation ends and the job NHC will be executed anyway. Bug 4074
-
- 05 Sep, 2017 1 commit
-
-
Brian Christiansen authored
From ea621ec4
-
- 04 Sep, 2017 1 commit
-
-
Alejandro Sanchez authored
Initially job mem limits were tested at submission time through _validate_min_mem_partition() -> _valid_pn_min_mem(), but not tested again at scheduling time, thus leading to jobs incorrectly being scheduled against partitions where the job exceeded their MaxMemPer* limit (which can in turn be inherited from the system wide limit too). NOTE: New WAIT_PN_MEM_LIMIT job_state_reason enum component added to support this new waiting reason. Bug 2291.
-
- 02 Sep, 2017 1 commit
-
-
Morris Jette authored
Modify step create logic so that call components of a heterogeneous job launched by a single srun command have the same step ID value.
-
- 01 Sep, 2017 4 commits
-
-
Danny Auble authored
checked on submit. This only mattered when submitting a job to multiple partitions. Bug 4066
-
Danny Auble authored
on node 0. Bug 4035
-
Morris Jette authored
Prevent a heterogeneous job allocation from including the same nodes in multiple components (required by MPI jobs spanning components).
-
Tim Wickberg authored
To avoid a reliance on the slurm.conf file (as this command is designed to help you construct said file) the path to check the space was hard-coded as /tmp. But, if TmpFS is meant to be elsewhere we'd emit the wrong value here. Bug 3272.
-
- 31 Aug, 2017 1 commit
-
-
Alejandro Sanchez authored
unsupported core-based reservations.
-
- 30 Aug, 2017 3 commits
-
-
David Gloe authored
Statically linked Cray PMI applications still expect to use some file paths containing the old SLURM_ID_HASH format. Some Cray customers have certification requirements that make recompilation difficult. The attached patch defines a macro to convert the new SLURM_ID_HASH to the old format, and writes the files and symlinks necessary for statically linked Cray PMI applications to work. Bug 4114
-
Danny Auble authored
frequency on the batch step. Bug 4073 Also see Bug 3510
-
Tim Wickberg authored
-
- 29 Aug, 2017 5 commits
-
-
Brian Christiansen authored
Bug 4090
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
relies on the primary to do so. There is a potential race condition if the backup DBD tries to create/check the database at the same time as the primary. This patch removes this race by not allowing the backup to do the check/create. Bug 3827
-
- 25 Aug, 2017 2 commits
-
-
Morris Jette authored
These are required by OpenMPI
-
Morris Jette authored
Modify output of "--mpi=list" to avoid duplicates for version numbers in mpi/pmix plugin names.
-
- 24 Aug, 2017 2 commits
-
-
Morris Jette authored
Modify sbcast command and srun's --bcast option to support heterogeneous jobs. bug 4099
-
Alejandro Sanchez authored
Calling bit_unfmt() with a zero bit_size() bitmap leads to a later call to bit_nclear() with start=0 and stop=-1, leading to the ABRT. This scenario happened when cgroup.conf has ConstrainDevices=yes and task_cgroup_devices_create() tries to collect the GRES devices but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt); of zero size which is passed by argument to bit_unfmt(). gres_cpu_cnt is 0 because we have defined a gres.conf like this: Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1 Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1 Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3 Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3 but have no GresTypes nor GRES option in the slurm.conf / node config def. Bug 3974
-
- 23 Aug, 2017 1 commit
-
-
Alejandro Sanchez authored
Running slurmctld under valgrind while operating with jobcomp/elasticsearch reported the following bytes definitely lost: ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342 ==27403== at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==27403== by 0x2281B3: slurm_xrealloc (xmalloc.c:137) ==27403== by 0x22856A: makespace (xstring.c:114) ==27403== by 0x2285D0: _xstrcat (xstring.c:132) ==27403== by 0x228CE0: _xstrfmtcat (xstring.c:291) ==27403== by 0x83C5BCD: ??? ==27403== by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172) ==27403== by 0x18D8FC: job_completion_logger (job_mgr.c:13652) It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to the corresponding job_node->serialized_job, but the originally generated buffer wasn't freed afterwards. The fix consists in change the transfer so that instead of xstrdup'ing the char * we just assign the pointer and NULL the buffer. The job_node->serialized_job was already xfree'd properly later when the job was indexed. Discovered while working on Bug 4065.
-
- 22 Aug, 2017 2 commits
-
-
Alejandro Sanchez authored
Otherwise the resulting URL may be invalid. Update documentation while here as well. Bug 4065.
-
Philip Kovacs authored
bug 4095
-
- 21 Aug, 2017 2 commits
-
-
Isaac Hartung authored
Print numbers using exponential format if required to fit in allocated field width. The sacctmgr and sshare commands are impacted. bug 1749
-
Alejandro Sanchez authored
Given a configuration with TopologyParam including Dragonfly option, if a job requested --switches count, the count timeout specified by either the job request or max_switch_wait SchedulerParameters was not respected. This was due to leaf_switch_count variable not being incremented in _eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(), the later being a execution path which already succeed to wait for the switch count timeout. Bug 4056
-
- 18 Aug, 2017 3 commits
-
-
Morris Jette authored
-
Brian Christiansen authored
-
Alejandro Sanchez authored
Add the following fields as environment variables: CLUSTER, DEPENDENCY, DERIVEDEC, EXITCODE, GROUPNAME, QOS, RESERVATION, USERNAME. LIMIT env variable value format (which means the TimeLimit of the job) has been modified to D-HH:MM:SS. Bug 3942
-
- 17 Aug, 2017 2 commits
-
-
Morris Jette authored
In srun, if only some steps are allocated and one step allocation fails, then delete all allocated steps.
-
Morris Jette authored
Coverity CID 44649 Bug 4085
-
- 16 Aug, 2017 3 commits
-
-
Morris Jette authored
Set SLURM_NTASKS environment variable to reflect global task count (needed by MPI).
-
Morris Jette authored
Set SLURM_PROCID environment variable to reflect global task rank (needed by MPI).
-
Danny Auble authored
instead of local. Bug 3546
-
- 15 Aug, 2017 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
bug 3217
-
Morris Jette authored
-