Improve fault-tolerance for batch jobs. If a node fails to respond to the
batch_job_launch RPC, then deallocate those resources and requeue the job. If a node registers and fails to show a batch job that should have a script running there (node zero of allocation), then consider the job complete.
Please register or sign in to comment