Commit d26dc971 authored by Moe Jette's avatar Moe Jette
Browse files

select/cray: special case for "no such resId"

For robustness, it does make sense calling the RELEASE method multiple times.

The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in
section 3.3.5, "Improve RELEASE method response" the following:

 "Periodically send RELEASE method requests until the RELEASE method
  response indicates the reservation is gone (via an error response)".

The typical error message for this case is (also shown in the document on page 11):

[2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087
[2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543

There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel:
 1. scancel causes job_signal() to be invoked,
 2. job_signal() defers to select_g_job_signal(),
    - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL,
    - at this stage the reservation is already released,
 3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(),
 4. this calls select_g_job_fini(),
    - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(),
    - since at this stage the reservation has already been released, the error message results.

I don't like this error message myself, but avoiding it in all call paths is complicated. Also,
for robustness, I would very much prefer that do_basil_release() is always called from
deallocate_nodes().

Hence this patch creates a custom error class for the case "no entry for resId xxx". This
allows the calling function to still catch the error, but the unnecessary warning is no
longer printed in the logfiles.

The callers of this method are:
 * do_basil_release() - which already is set up to handle error/non-error case;
 * basil_safe_release() - this does not extra error checking, since it is called
   when trying to remove orphaned reservations, any failure in attempting to 
   release the reservation will result in repeated "orphaned ALPS reservation ..."
   messages.select/cray: special case for "no such resId"

For robustness, it does make sense calling the RELEASE method multiple times.

The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in
section 3.3.5, "Improve RELEASE method response" the following:

 "Periodically send RELEASE method requests until the RELEASE method
  response indicates the reservation is gone (via an error response)".

The typical error message for this case is (also shown in the document on page 11):

[2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087
[2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543

There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel:
 1. scancel causes job_signal() to be invoked,
 2. job_signal() defers to select_g_job_signal(),
    - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL,
    - at this stage the reservation is already released,
 3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(),
 4. this calls select_g_job_fini(),
    - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(),
    - since at this stage the reservation has already been released, the error message results.

I don't like this error message myself, but avoiding it in all call paths is complicated. Also,
for robustness, I would very much prefer that do_basil_release() is always called from
deallocate_nodes().

Hence this patch creates a custom error class for the case "no entry for resId xxx". This
allows the calling function to still catch the error, but the unnecessary warning is no
longer printed in the logfiles.

The callers of this method are:
 * do_basil_release() - which already is set up to handle error/non-error case;
 * basil_safe_release() - this does not extra error checking, since it is called
   when trying to remove orphaned reservations, any failure in attempting to 
   release the reservation will result in repeated "orphaned ALPS reservation ..."
   messages.
parent 164ed8df
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment