- 20 Jan, 2017 40 commits
-
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
In favor of just using the -a option to show the tracking federated jobs. This allows scontrol -a show jobs to show the tracking jobs as well.
-
Brian Christiansen authored
-
Brian Christiansen authored
to indicate wheter the job was requeue held or not. This enables the federation to trigger off whether the job was requeue held or not.
-
Brian Christiansen authored
So that the origin job tell a remote cluster to cancel the job but mark the job as requeued in the database. See note about the KILL_* flags actually using 12bits instead of noted 8bits.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
Follows pattern from c5ace562
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
If a job was requeued while in the completing state, the database wasn't being updated with the requeue state.
-
Brian Christiansen authored
When a fed job is requeued, it needs to be requeued to clusters that it was submittted to.
-
Brian Christiansen authored
When the a fed job is requeued and new siblings are submitted to the other siblings, the restart_cnt needs to go to the siblings in case the job runs on a remote sibling.
-
Brian Christiansen authored
The federation needs to make a job_desc when requeueing jobs to siblings.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
Since a persistent connection can only be established by SlurmUser this prevents non-slurmuser users from calling the rpcs. It also requires that all slurmctlds in the federation have the same SlurmUser.
-
Brian Christiansen authored
-
Brian Christiansen authored
If the job can't start now, just submit the job to all siblings.
-
Brian Christiansen authored
_update_sibling_job_siblings()
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
like it does in slurm_send_recv_msg. The resp needs to be inited before _check_send it called.
-
Brian Christiansen authored
Sibling jobs have to get lock from the origin cluster in order to attempt to allocate nodes. If it gets the allocation then it lets the origin cluster know and the origin cluster will set the siblings jobs, if any, into a REVOKED state and purge the jobs. If the sibling job is the only sibling then it assumes the lock and attempts to start the job to avoid extra communications. If nodes can't be allocated then the job releases the lock for another cluster to try.
-
Brian Christiansen authored
-
Brian Christiansen authored
for fed sibling jobs that don't start.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
To handle JOB_REVOKED
-
Brian Christiansen authored
-