- 15 Sep, 2017 2 commits
-
-
Tim Wickberg authored
-
Dominik Bartkiewicz authored
-
- 14 Sep, 2017 2 commits
-
-
Tim Wickberg authored
A second PMI2_Init() within the same step is invalid, and cannot succeed. Return an error code back to the client end, and close the fd to force the step to terminate immediately. Due to a bug in our libpmi code, just returning a cmd=response_to_init with an appropriate rc number will not tear down the connection properly, so send back something else that will trigger the error path. Bug 3520.
-
Morris Jette authored
A request to cancel a pack step leader will result in that step being cancelled on all pack job components. Needed by MPI.
-
- 13 Sep, 2017 3 commits
-
-
Morris Jette authored
mpi/pmi2 plugin - vestigial pointer could be referenced at shutdown with invalid memory reference resulting.
-
Josh Samuelson authored
Bug 4154.
-
Dominik Bartkiewicz authored
Use the partition specified in the reservation, rather than using the cluster's default. Bug 4030.
-
- 12 Sep, 2017 7 commits
-
-
Danny Auble authored
default path. This makes it so you don't always have to put AllowedDevicesFile in your cgroup.conf file if your etc dir is anything other than /etc/slurm.
-
Morris Jette authored
Don't flag a job as "SPAWNED" for debugger until all process information is available for all pack job components.
-
Tim Wickberg authored
Adding a newline prevents this error: conftest.c:154:8: error: if statement has empty body [-Werror,-Wempty-body]
-
Alejandro Sanchez authored
remote cluster correctly determine the select type. Bug 2329
-
Morris Jette authored
Modify "srun --mpi=list" output to match valid option input by removing the "mpi/" prefix on each line of output.
-
Morris Jette authored
Enable them onlyh with SchedulerParameters=enable_hetero_jobs OR MPI type is "none"
-
Brian Christiansen authored
Do pointer comparisons rather than strcmps. ~80x speedup Bug 3529 e.g. 1000 nodes 8000 tasks [Sep 11 14:24:15.873639 20992 srvcn 0x7f8c1cdda700] _task_layout_hostfile: hostfile processing took usec=2152678 (orig) [Sep 11 14:27:46.173424 20992 srvcn 0x7f8c1c6d3700] _task_layout_hostfile: hostfile processing took usec=2142997 (orig) [Sep 11 14:32:32.245420 4037 srvcn 0x7f12de4e4700] _task_layout_hostfile: hostfile processing took usec=26198 (node ptrs) [Sep 11 14:36:12.88769 4037 srvcn 0x7f12de6e6700] _task_layout_hostfile: hostfile processing took usec=25515 (node ptrs) [Sep 11 14:41:38.339162 4037 srvcn 0x7f132c8d5700] _task_layout_hostfile: hostfile processing took usec=27459 (node ptrs) [Sep 11 15:16:59.575189 1874 srvcn 0x7f3dae3f0700] _task_layout_hostfile: hostfile processing took usec=30129 (node ptrs) [Sep 11 15:20:50.365004 1874 srvcn 0x7f3dc8b34700] _task_layout_hostfile: hostfile processing took usec=29884 (node ptrs)
-
- 11 Sep, 2017 2 commits
-
-
Morris Jette authored
-
Ole H Nielsen authored
-
- 08 Sep, 2017 6 commits
-
-
Dominik Bartkiewicz authored
If /proc was inaccessible proc_name would leak. Put an explicit length cap in sprintf to avoid warning. The size is checked immediate before here so this is just making the 10-char limit explicit. Bug 4062.
-
Morris Jette authored
-
Tim Wickberg authored
If the network path to shared storage used for the StateSaveLocation is separate from that used to communicate with the cluster, both the primary and backup controllers can end up acting as master on loss of the cluster network. Alter the HA takeover code path to make sure that the job state save file is not still being updated by the primary slurmctld. If it is, refuse to takeover and retry again later. Bug 3592.
-
Dominik Bartkiewicz authored
Bug 4062.
-
Marshall Garey authored
-
Marshall Garey authored
Bug 4084
-
- 07 Sep, 2017 2 commits
-
-
Dominik Bartkiewicz authored
bug 3824
-
Morris Jette authored
Do not run the Node Health Check on termination of the external step as this happens when the job allocation ends and the job NHC will be executed anyway. Bug 4074
-
- 05 Sep, 2017 1 commit
-
-
Brian Christiansen authored
From ea621ec4
-
- 04 Sep, 2017 1 commit
-
-
Alejandro Sanchez authored
Initially job mem limits were tested at submission time through _validate_min_mem_partition() -> _valid_pn_min_mem(), but not tested again at scheduling time, thus leading to jobs incorrectly being scheduled against partitions where the job exceeded their MaxMemPer* limit (which can in turn be inherited from the system wide limit too). NOTE: New WAIT_PN_MEM_LIMIT job_state_reason enum component added to support this new waiting reason. Bug 2291.
-
- 02 Sep, 2017 1 commit
-
-
Morris Jette authored
Modify step create logic so that call components of a heterogeneous job launched by a single srun command have the same step ID value.
-
- 01 Sep, 2017 4 commits
-
-
Danny Auble authored
checked on submit. This only mattered when submitting a job to multiple partitions. Bug 4066
-
Danny Auble authored
on node 0. Bug 4035
-
Morris Jette authored
Prevent a heterogeneous job allocation from including the same nodes in multiple components (required by MPI jobs spanning components).
-
Tim Wickberg authored
To avoid a reliance on the slurm.conf file (as this command is designed to help you construct said file) the path to check the space was hard-coded as /tmp. But, if TmpFS is meant to be elsewhere we'd emit the wrong value here. Bug 3272.
-
- 31 Aug, 2017 1 commit
-
-
Alejandro Sanchez authored
unsupported core-based reservations.
-
- 30 Aug, 2017 3 commits
-
-
David Gloe authored
Statically linked Cray PMI applications still expect to use some file paths containing the old SLURM_ID_HASH format. Some Cray customers have certification requirements that make recompilation difficult. The attached patch defines a macro to convert the new SLURM_ID_HASH to the old format, and writes the files and symlinks necessary for statically linked Cray PMI applications to work. Bug 4114
-
Danny Auble authored
frequency on the batch step. Bug 4073 Also see Bug 3510
-
Tim Wickberg authored
-
- 29 Aug, 2017 5 commits
-
-
Brian Christiansen authored
Bug 4090
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
relies on the primary to do so. There is a potential race condition if the backup DBD tries to create/check the database at the same time as the primary. This patch removes this race by not allowing the backup to do the check/create. Bug 3827
-