select/cray: special case for "no such resId"
For robustness, it does make sense calling the RELEASE method multiple times. The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in section 3.3.5, "Improve RELEASE method response" the following: "Periodically send RELEASE method requests until the RELEASE method response indicates the reservation is gone (via an error response)". The typical error message for this case is (also shown in the document on page 11): [2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087 [2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543 There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel: 1. scancel causes job_signal() to be invoked, 2. job_signal() defers to select_g_job_signal(), - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL, - at this stage the reservation is already released, 3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(), 4. this calls select_g_job_fini(), - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(), - since at this stage the reservation has already been released, the error message results. I don't like this error message myself, but avoiding it in all call paths is complicated. Also, for robustness, I would very much prefer that do_basil_release() is always called from deallocate_nodes(). Hence this patch creates a custom error class for the case "no entry for resId xxx". This allows the calling function to still catch the error, but the unnecessary warning is no longer printed in the logfiles. The callers of this method are: * do_basil_release() - which already is set up to handle error/non-error case; * basil_safe_release() - this does not extra error checking, since it is called when trying to remove orphaned reservations, any failure in attempting to release the reservation will result in repeated "orphaned ALPS reservation ..." messages.select/cray: special case for "no such resId" For robustness, it does make sense calling the RELEASE method multiple times. The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in section 3.3.5, "Improve RELEASE method response" the following: "Periodically send RELEASE method requests until the RELEASE method response indicates the reservation is gone (via an error response)". The typical error message for this case is (also shown in the document on page 11): [2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087 [2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543 There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel: 1. scancel causes job_signal() to be invoked, 2. job_signal() defers to select_g_job_signal(), - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL, - at this stage the reservation is already released, 3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(), 4. this calls select_g_job_fini(), - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(), - since at this stage the reservation has already been released, the error message results. I don't like this error message myself, but avoiding it in all call paths is complicated. Also, for robustness, I would very much prefer that do_basil_release() is always called from deallocate_nodes(). Hence this patch creates a custom error class for the case "no entry for resId xxx". This allows the calling function to still catch the error, but the unnecessary warning is no longer printed in the logfiles. The callers of this method are: * do_basil_release() - which already is set up to handle error/non-error case; * basil_safe_release() - this does not extra error checking, since it is called when trying to remove orphaned reservations, any failure in attempting to release the reservation will result in repeated "orphaned ALPS reservation ..." messages.
Please register or sign in to comment