- 27 Nov, 2017 2 commits
-
-
Danny Auble authored
since the ABI could change from one version to the other. Bug 4435
-
Felip Moll authored
Bug 4431
-
- 24 Nov, 2017 1 commit
-
-
Brian Christiansen authored
If a pending federated job exists on clusters 2 and 3 and squeue is run from cluster 1 then the active siblings can come and go depending on which cluster returns the job info first and depending if that cluster is the origin cluster or not. The origin cluster only knows where the active siblings are.
-
- 22 Nov, 2017 4 commits
-
-
Brian Christiansen authored
from status commands. Bug 4341
-
Danny Auble authored
Add in strong_alias calls so these functions will appear in libslurm and not just libslurmfull. Otherwise test7.3 can fail due to a missing symbol if a gpu gres is allocated to the job. Bug 4415.
-
Tim Wickberg authored
This setting causes /unix to be omitted from the xauth string. Simplify the regex to handle this by adding / to the earlier match and dropping the /unix pattern. Bug 4417.
-
Morris Jette authored
bug 4400
-
- 21 Nov, 2017 5 commits
-
-
Dominik Bartkiewicz authored
Can cause slurmstepd to crash, as rlimit_name was pointing to part of the free'd env_name variable. Bug 4409.
-
Artem Polyakov authored
This patch has fixed the problem to me. We are going to do some more verification later today and update. But I would appreciate if somebody else can test it as well. Signed-off-by: Danny Auble <da@schedmd.com>
-
Morris Jette authored
There was a list of pending pack job records under consideration for scheduling by the backfill plugin that was not being cleared between interations of the backfill scheduler resulting in various scheduling anomalies. bug 4371, 4400
-
Morris Jette authored
For heterogeneous job steps, the srun --open-mode option default value will be set to "append".
-
Patrice Peterson authored
The regex in x11_set_xauth() did not match FQDNs because it was missing a dot. Bug 4398.
-
- 20 Nov, 2017 2 commits
-
-
Morris Jette authored
Add SchedulerParameters=whole_pack configuration parameter. If set, then hold, release and cancel operations on any component of a heterogeneous job will be applied to all components. bug 4374
-
Felip Moll authored
Bug 4393.
-
- 17 Nov, 2017 1 commit
-
-
Morris Jette authored
bug 4366
-
- 16 Nov, 2017 4 commits
-
-
Morris Jette authored
Correct printing error type based upon errno rather than returned rc.
-
Dominik Bartkiewicz authored
If PrologSlurmctld fails for pack job leader then kill all components of the job. bug 4379
-
Dominik Bartkiewicz authored
Add SLURM_PACK_JOB_NODELIST to PrologSlurmctld and EpilogSlurmctld environment. bug 4379
-
Morris Jette authored
bug 4370
-
- 15 Nov, 2017 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
Prevent scheduling deadlock with multiple components of heterogeneous job in different partitions (i.e. one heterogeneous job component is higher priority in one partition and another component is lower priority in a different partition). bug 4370
-
Alejandro Sanchez authored
From within slurm_job_submit(): job_desc.pack_job_offset From within slurm_job_modify(): job_rec.pack_job_id job_rec.pack_job_id_set job_rec.pack_job_offset Bug 4372.
-
Felip Moll authored
bug 4368
-
Dominik Bartkiewicz authored
Add SLURM_PACK_JOB_ID and SLURM_PACK_JOB_OFFSET to PrologSlurmctld and EpilogSlurmctld environment bug 4379
-
- 14 Nov, 2017 1 commit
-
-
Morris Jette authored
Avoid srun abort trying to run on heterogeneous job component that has ended. bug 4366
-
- 13 Nov, 2017 1 commit
-
-
Tim Wickberg authored
In a prior incarnation of the patch that introduced it, it was MaxQueryTimeLimit, and that was not updated with the code base when changed. Bug 4365.
-
- 10 Nov, 2017 2 commits
-
-
Tim Wickberg authored
-
Isaac Hartung authored
This now matches the sinfo documentation. Bug 4306.
-
- 09 Nov, 2017 5 commits
-
-
Morris Jette authored
launch/slurm plugin - Avoid using global variable for heterogeneous job steps, which could corrupt memory. bug 4333
-
Morris Jette authored
Ancient versions of OpenMPI and their derivatives (i.e. Cray MPI) are dependent upon communication ports being assigned to them by Slurm. Such MPI jobs will experience step launch failure if any component of a heterogeneous job step is unable to acquire the allocated ports. Non-heterogeneous job steps will retry step launch using a new set of communication ports (no change in Slurm behavior). NOTE: Correcting this would necessitate assigning the same set of ports to all components of the heterogeneous job (not possible today) plus changes to srun in order to better synchronize the step startup and error handling.
-
Dominik Bartkiewicz authored
Same logic as done in commit fb296c70 done for energy. Bug 4336
-
Morris Jette authored
If heterogeneous job step is unable to acquire MPI reserved ports then avoid referencing NULL pointer. bug 4333
-
Tim Wickberg authored
Bug 3647.
-
- 08 Nov, 2017 3 commits
-
-
Tim Wickberg authored
%{_libdir} is /usr/lib64. Adding this can cause ordering problems with the library search path, and is redundant. And %{_libdir}/slurm is not needed by Slurm itself; it relies on rpath to resolve the install location instead. So, this is only useful on a non-standard install, in which case it's best left to the site to decide how to handle this instead. Bug 4344.
-
Brian Christiansen authored
Bug 4141
-
Brian Christiansen authored
When updating NumNodes, cpus, tasks, gres, memory, etc. need to adjusted so that counts are correct. Co-authored with Danny Auble
-
- 07 Nov, 2017 3 commits
-
-
Danny Auble authored
This requires changing the resv_id and then also updating that on all the jobs. Since this requires having the job write lock we spawn a new thread instead of altering all the paths into _advance_resv_time(). Bug 4246
-
Alejandro Sanchez authored
Issue could be triggered when updating a partition node(s) with node(s) that were already in the partition, incorrectly increasing the node_record->part_cnt (number of associated partitions) and thus incorrectly extending the array of pointers to partitions associated with this node, leading to an array with repeated associated partitions pointers. Bug 4318.
-
Brian Gilmer authored
On CLE 6.0 mungedir is /usr; a 'module unload' call then removes /usr/bin from PATH which is rather inconvenient. Bug 4334.
-
- 06 Nov, 2017 1 commit
-
-
Alejandro Sanchez authored
A slurmctld lock needs to at least have a node and partition write lock set before set_cluster_tres() is called. The slurmctld "cold-restart" path was not covered. Bug 4318.
-