- 05 Oct, 2016 2 commits
-
-
Morris Jette authored
node_features/knl_cray plugin: Remove any KNL MCDRAM or NUMA features from node's configuration if capmc does NOT report the node as being KNL. For example, we don't want a non-KNL node with features="quad,cache".
-
Morris Jette authored
Add new knl.conf configuration parameter CapmcRetries Modify capmc_suspend and capmc_resume to retry operations when Cray State Manager is down. Add retry logic to node_features/knl_cray to handle Cray State manager being down. bug 3100
-
- 03 Oct, 2016 1 commit
-
-
Dominik Bartkiewicz authored
-
- 30 Sep, 2016 8 commits
-
-
Alejandro Sanchez authored
Otherwise they'll truncate when packed into the RPC and end up as some bizarre value at the controller. Bug 3098.
-
Dominik Bartkiewicz authored
Set completed time for pending/running runaway jobs to the max of (start, eligible, submit) times. Bug 3075
-
Morris Jette authored
Added new SchedulerParameters options step_retry_count and step_retry_time to control scheduling behaviour of job steps waiting for resources. bug 3121
-
Morris Jette authored
This change can reduce RPCs sent by srun up to 50%.
-
Artem Polyakov authored
Avoid using slurm_forward_data because it causes thread spawn that introduces unwanted delays. Bug 3102.
-
Tim Wickberg authored
-
Morris Jette authored
THis can happen due to signal/retry logic and the result is vestigial srun ports that are no longer valid and slurmctld sending messages to invalid ports to notify them about available resources as other steps complete
-
Morris Jette authored
-
- 29 Sep, 2016 9 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Alejandro Sanchez authored
Also correct the value of NICE_OFFSET used within the perl API. Bug 3098.
-
Artem Polyakov authored
Bug 3051.
-
Tim Wickberg authored
Otherwise updates would be rejected for running jobs even if there would be no impact. Most common when the job_submit plugin is overriding QOS/GRES values on everything; without this change an update to "comment" or other fields would fail with ESLURM_JOB_NOT_PENDING. Bug 3117.
-
Tim Wickberg authored
Never ever run NHC, even on an edge case that NHC_NO would still launch NHC after. Bug 3105.
-
Tim Wickberg authored
Switch to list_for_each, and check if access list actually changed after each update before updating last_prat_update. This prevents the backfill scheduler from resetting mid-cycle unnecessarily. Bug 3123.
-
Tim Wickberg authored
Switch to list_for_each, and check if access list actually changed after each update before updating last_prat_update. This prevents the backfill scheduler from resetting mid-cycle unnecessarily. Bug 3123.
-
Morris Jette authored
Add protocol version to slurmd startup communications for slurmstepd to permit changes in the protocol.
-
- 28 Sep, 2016 3 commits
-
-
Morris Jette authored
Add "flag" field to launch_tasks_request_msg. Remove the following fields (moved into flags): multi_prog, task_flags, user_managed_io, pty, buffered_stdio, and labelio. More flags to be added later.
-
Tim Wickberg authored
Remove from build system, and delete L/P specific files. Run autogen.sh as well.
-
Morris Jette authored
Add "sbatch_wait_nodes" to SchedulerParameters to control default sbatch behaviour with respect to waiting for all allocated nodes to be ready for use. Job can override the configuration option using the --wait-all-nodes=# option. bug 3120
-
- 27 Sep, 2016 2 commits
-
-
Morris Jette authored
Prior logic would treat execute line like this: $ sbatch --wait-all-nodes -N3 tmp with "-N3" as being the argument to the "--wait-all-nodes" option. See bug 3120
-
Morris Jette authored
Add salloc/sbatch/srun option --use-min-nodes to prefer smaller node counts when a range of node counts is specified (e.g. "-N 2-4"). bug 2996
-
- 26 Sep, 2016 1 commit
-
-
Morris Jette authored
Add salloc/sbatch/srun --priority option of "TOP" to set job priority to the highest possible value. This option is only available to Slurm operators and administrators. bug 3115
-
- 24 Sep, 2016 2 commits
-
-
Morris Jette authored
bug 3090
-
Morris Jette authored
Make sure no attempt is made to schedule a requeued job until all steps are cleaned (Node Health Check completes for all steps on a Cray). bug 3082
-
- 23 Sep, 2016 1 commit
-
-
Morris Jette authored
Make sure no attempt is made to schedule a requeued job until all steps are cleaned (Node Health Check completes for all steps on a Cray). bug 3082
-
- 22 Sep, 2016 6 commits
-
-
Dominik Bartkiewicz authored
Otherwise limit is checking the node count against the midplane count. Bug 3049.
-
Alejandro Sanchez authored
Check if node names are contiguous with respect to the node list assigned to the partition, rather than just monotonically increasing. Bug 3006.
-
Tim Wickberg authored
-
Janne Blomqvist authored
Bugs 2681 and 2703 Conflicts: NEWS
-
Adam Moody authored
-
Alejandro Sanchez authored
license of a certain type.
-
- 21 Sep, 2016 5 commits
-
-
Morris Jette authored
node_features/knl_cray plugin: Increase default CapmcTimeout parameter from 10 to 60 seconds. bug 3100
-
Morris Jette authored
capmc_suspend/resume - If a request modify NUMA or MCDRAM state on a set of nodes or reboot a set of nodes fails then just requeue the job and abort the entire operation rather than trying to operate on individual nodes. bug 3100
-
Morris Jette authored
Allow a node's PowerUp state flag to be cleared using update_node RPC. bug 3100
-
Morris Jette authored
When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode) then pass to the ResumeProgram the job ID assigned to the nodes in the SLURM_JOB_ID environment variable. bug 3100
-
Morris Jette authored
Don't log error for job end_time being zero if node health check is still running. bug 3053
-