- 20 Dec, 2017 14 commits
-
-
Danny Auble authored
-
Alejandro Sanchez authored
This mostly reverts commit 8325e9f8. Bug 4478 We decided to not do this by default for the slurmd. See bug 4478 for more info/reasoning. We also didn't want to rely on #include <linux/oom.h> since that might not exist everywhere.
-
Tim Wickberg authored
Format for inserted X11 cookies is now hostname/unix:display, rather than localhost:display. This seems to resolve some issues for different setups while not causing any issues in my test environments. This also helps avoid a conflict between multiple nodes in a larger job if the same display number is allocated on multiple nodes simultaneously. Bug 4420.
-
Danny Auble authored
Bug 4483
-
Danny Auble authored
Bug 4541
-
Danny Auble authored
The slurm and pmix RPMs both attempt to install a libpmi. While Slurm's version should be preferred, the best way to handle this is to have both pmix and slurm packages split this off into a separate RPM that can conflict with each other, giving the admin a way to resolve the conflict manually. Bug 4511.
-
Brian Christiansen authored
The cluster name could have been free'd before it would be strdup'ed. Happens when a federated persisent connection is established and the cluster needs to sync up with the remote cluster. Bug 4503
-
Brian Christiansen authored
Remove the origin job when a remote sibling job was cancelled while the origin was down. e.g. job originated on cluster1 job runs on cluster2 cluster1 goes down job is cancelled on cluster2 cluster1 comes backup. The origin job should be removed if the sibling job was cancelled.
-
Tim Wickberg authored
The user_name is only populated if LaunchParameters=send_gids is set. Without this field set, the pam_setup call will get NULL instead of the user_name, leading to batch job launch failures. (The step launch code already handles this same issue separately.) Bug 4412.
-
Alejandro Sanchez authored
On a job [pack]allocation RPC request, if the allocation succeed but the send response message back to the client failed (i.e. srun was killed before it could receive the response), then modify the job_record pointer so that the job_state is set to FAILED, the exit_code as if the job got a SIGTERM signal and the state_reason to FAIL_LAUNCH. Then users when querying the job with sacct can discern that something bad happend for this scenario, instead of STATE being showed as COMPLETED and the ExitCode as 0:0. Bug 4513.
-
Felip Moll authored
Use FREE_NULL_BUFFER instead, otherwise we could attempt to free_buffer this a second time if we jump to the rwfail label. Bug 4484.
-
Felip Moll authored
When printing fields in sacct with user specified units (--units), the nnodes field showed an incorrect string. This commit reverts a65fa572 and avoids the unit conversion, which does not make sense outside the context of a Blue Gene systems (deprecated) anyways. Bug 4490.
-
Felip Moll authored
Slurm may generate empty manifest files depending on configuration and library availability. Disable the new empty manifest check to allow builds to proceed with rpm 4.13+ / Fedora 25+. Bug 4453.
-
Morris Jette authored
-
- 19 Dec, 2017 5 commits
-
-
Danny Auble authored
before printing anything for a connection.
-
Morris Jette authored
field. Bug 4529
-
Morris Jette authored
fails. The description of the failure will be in the job's "Reason" field. Bug 4529
-
Morris Jette authored
buffer error. Bug 4529
-
Alejandro Sanchez authored
Bug 4222.
-
- 18 Dec, 2017 2 commits
-
-
Brian Christiansen authored
on startup. Just use the checkpointed job_id_sequence. get_next_job_id() will not use a jobid if it's in use in the system. Bug 4538
-
Morris Jette authored
node_features/knl_generic - If plugin can not fully load then do not spawn a background pthread (which will fail with invalid memory reference).
-
- 15 Dec, 2017 4 commits
-
-
Morris Jette authored
completely instead of right after the parent job finishes. Bug 4516
-
Yair Yarom authored
bug 3582
-
Brian Christiansen authored
when a job requests no tasks and more memory than MaxMemPer{CPU|NODE}. e.g. sbatch --wrap="sleep 10" Bug 4515
-
Brian Christiansen authored
This will give expected results. Found while working on Bug 4515.
-
- 14 Dec, 2017 1 commit
-
-
Danny Auble authored
And print an appropriate fatal error message rather than relying upon random errno value. Bug 4523
-
- 13 Dec, 2017 1 commit
-
-
Alejandro Sanchez authored
Bug 4478.
-
- 12 Dec, 2017 1 commit
-
-
Brian Christiansen authored
In the federation case, the origin job is completed in the database when a sibling job starts the job. The complete message is then sent again to the database when the job is completed on the sibling cluster but it is updated with the sibling job's exit code. The jobcomp plugin didn't handle the multiple updates to the record. This change allows the existing record to be updated. Bug 4493
-
- 11 Dec, 2017 2 commits
-
-
David Gloe authored
Bug 4500 The pid files in slurm.conf and the systemd service files must match, or systemd will time out looking for the wrong pid file. Currently, the Cray slurm.conf template has different pid files for slurmctld and slurmd than the service files. There's no reason for us to use these nonstandard pid files, and it will save us some headaches to switch over.
-
Marcin Stolarek authored
bug 4496
-
- 08 Dec, 2017 2 commits
-
-
Danny Auble authored
In 1.10+ they changed the hid_t from an int to a long int which messes things up as they use the top 32 bits for stuff right off the bat. This fixes the scenario by handing the number with a int32_t instead of an int. Bug 3795
-
Morris Jette authored
Fix potential node reboot timeout problem for "scontrol reboot" command. bug 4203
-
- 07 Dec, 2017 4 commits
-
-
Tim Wickberg authored
The - character is treated as a range if not first or last in the [] brackets. Moving it in between . and / broke the regex subtly. Inadvertently broken by a268b644. Bug 4417.
-
Danny Auble authored
Bug 4169
-
Morris Jette authored
Found using test38.17
-
Felip Moll authored
Otherwise poll() cannot monitor these ports properly, leading to potential network traffic problems. Bug 4467.
-
- 06 Dec, 2017 3 commits
-
-
Danny Auble authored
until the prolog and extern step are fully ran/launched. Only matters if running with PrologFlags=[contain|alloc]. patch 2 of 2 Bug 4458
-
Danny Auble authored
Patch 1 of 2 Bug 4458
-
David Gloe authored
Due to the way Cray builds Slurm, the prefix and bindir paths include the Slurm version (/opt/slurm/<version>). This means every time we update to a new Slurm version we must update the Slurm ansible playbook. It also means that the slurm_playbook.yaml file must be built with Slurm to be used (it can't simply be copied directly). The attached patch updates the playbook to determine the version of Slurm to use from the module file, and hardcodes the sysconfdir setting we give in our Slurm installation guide. If a customer uses different paths, they can update the playbook to meet their needs. Bug 4360.
-
- 05 Dec, 2017 1 commit
-
-
Dominik Bartkiewicz authored
when trying to signal a step that is still running a prolog. Bug 4446
-