- 22 Sep, 2016 8 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Brian Christiansen authored
-
Alejandro Sanchez authored
license of a certain type.
-
Brian Christiansen authored
This reverts commit 54a270f7.
-
Danny Auble authored
the fed_mgr.
-
Brian Christiansen authored
-
- 21 Sep, 2016 11 commits
-
-
Tim Wickberg authored
-
Morris Jette authored
node_features/knl_cray plugin: Increase default CapmcTimeout parameter from 10 to 60 seconds. bug 3100
-
Morris Jette authored
capmc_suspend/resume - If a request modify NUMA or MCDRAM state on a set of nodes or reboot a set of nodes fails then just requeue the job and abort the entire operation rather than trying to operate on individual nodes. bug 3100
-
Morris Jette authored
Allow a node's PowerUp state flag to be cleared using update_node RPC. bug 3100
-
Morris Jette authored
When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode) then pass to the ResumeProgram the job ID assigned to the nodes in the SLURM_JOB_ID environment variable. bug 3100
-
Morris Jette authored
Don't log error for job end_time being zero if node health check is still running. bug 3053
-
Morris Jette authored
capmc_suspend/resume - If a request modify NUMA or MCDRAM state on a set of nodes or reboot a set of nodes fails then just requeue the job and abort the entire operation rather than trying to operate on individual nodes. bug 3100
-
Morris Jette authored
Allow a node's PowerUp state flag to be cleared using update_node RPC. bug 3100
-
Morris Jette authored
When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode) then pass to the ResumeProgram the job ID assigned to the nodes in the SLURM_JOB_ID environment variable. bug 3100
-
Brian Christiansen authored
Previous logic duplicated checking error_codes returned from job_allocate. job_allocate() will set job state to FAILED if there was an actual issue.
-
Brian Christiansen authored
Was just checking for ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE and ENFORCE_ALL however in _slurm_rpc_allocate_resources() and _slurm_rpc_submit_batch_job() both check for ANY and ALL.
-
- 20 Sep, 2016 13 commits
-
-
Danny Auble authored
to siblings (If not already connected). This will happen when the next message is sent to them.
-
Tim Wickberg authored
-
Danny Auble authored
-
Danny Auble authored
sibling clusters.
-
Danny Auble authored
back a message to the caller.
-
Danny Auble authored
a federation connection, someone adding and removing the cluster from the federation lots of times at the same time the cluster could be not found.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
Don't log error for job end_time being zero if node health check is still running. bug 3053
-
Tim Wickberg authored
Fixes build issue caused by 844830d4.
-
Ben Matthews authored
-
- 19 Sep, 2016 8 commits
-
-
Danny Auble authored
connection
-
Danny Auble authored
-
Danny Auble authored
error.
-
Danny Auble authored
-
Danny Auble authored
at startup. Starting it up when you get a connection from another cluster could cause delays in processing the request.
-
Danny Auble authored
want to only wait for message_timeout instead of forever. Otherwise we could hit deadlock if the other person is trying to do the same thing.
-
Danny Auble authored
-
Danny Auble authored
processed at a time. Otherwise you could get issues if you are rapidly adding and removing a cluster from a federation. Probably not likely in real life, but in testing that is a different story.
-