Commit ed56303c authored by Morris Jette's avatar Morris Jette
Browse files

Fix for possible SEGV

Here's what seems to have happened:

- A job was pending, waiting for resources.
- slurm.conf was changed to remove some nodes, and a scontrol reconfigure was done.
- As a result of the reconfigure, the pending job became non-runnable, due to "Requested node configuration is not available". The scheduler set the job state to JOB_FAILED and called delete_job_details.
- scontrol reconfigure was done again.
- read_slurm_conf called _restore_job_dependencies.
- _restore_job_dependencies called build_feature_list for each job in the job list
- When build_feature_list tried to reference the now deleted job details for the failed job, it got a segmentation fault.

The problem was reported by a customer on Slurm 2.2.7.  I have not been able to reproduce it on 2.4.0-pre3, although the relevant code looks the same. There may be a timing window. The attached patch attempts to fix the problem by adding a check to _restore_job_dependencies.  If the job state is JOB_FAILED, the job is skipped.

Regards,
Martin

This is an alternative solutionh to bug316980fix.patch
parent 2ca7a0fc
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment