Commit e1147ea9 authored by Moe Jette's avatar Moe Jette
Browse files

Improve fault-tolerance for batch jobs. If a node fails to respond to the

batch_job_launch RPC, then deallocate those resources and requeue the job.
If a node registers and fails to show a batch job that should have a
script running there (node zero of allocation), then consider the job
complete.
parent 9d351634
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment