1. 17 Apr, 2011 8 commits
    • Moe Jette's avatar
      select/cray: special case for "no such resId" · d26dc971
      Moe Jette authored
      For robustness, it does make sense calling the RELEASE method multiple times.
      
      The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in
      section 3.3.5, "Improve RELEASE method response" the following:
      
       "Periodically send RELEASE method requests until the RELEASE method
        response indicates the reservation is gone (via an error response)".
      
      The typical error message for this case is (also shown in the document on page 11):
      
      [2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087
      [2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543
      
      There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel:
       1. scancel causes job_signal() to be invoked,
       2. job_signal() defers to select_g_job_signal(),
          - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL,
          - at this stage the reservation is already released,
       3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(),
       4. this calls select_g_job_fini(),
          - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(),
          - since at this stage the reservation has already been released, the error message results.
      
      I don't like this error message myself, but avoiding it in all call paths is complicated. Also,
      for robustness, I would very much prefer that do_basil_release() is always called from
      deallocate_nodes().
      
      Hence this patch creates a custom error class for the case "no entry for resId xxx". This
      allows the calling function to still catch the error, but the unnecessary warning is no
      longer printed in the logfiles.
      
      The callers of this method are:
       * do_basil_release() - which already is set up to handle error/non-error case;
       * basil_safe_release() - this does not extra error checking, since it is called
         when trying to remove orphaned reservations, any failure in attempting to 
         release the reservation will result in repeated "orphaned ALPS reservation ..."
         messages.select/cray: special case for "no such resId"
      
      For robustness, it does make sense calling the RELEASE method multiple times.
      
      The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in
      section 3.3.5, "Improve RELEASE method response" the following:
      
       "Periodically send RELEASE method requests until the RELEASE method
        response indicates the reservation is gone (via an error response)".
      
      The typical error message for this case is (also shown in the document on page 11):
      
      [2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087
      [2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543
      
      There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel:
       1. scancel causes job_signal() to be invoked,
       2. job_signal() defers to select_g_job_signal(),
          - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL,
          - at this stage the reservation is already released,
       3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(),
       4. this calls select_g_job_fini(),
          - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(),
          - since at this stage the reservation has already been released, the error message results.
      
      I don't like this error message myself, but avoiding it in all call paths is complicated. Also,
      for robustness, I would very much prefer that do_basil_release() is always called from
      deallocate_nodes().
      
      Hence this patch creates a custom error class for the case "no entry for resId xxx". This
      allows the calling function to still catch the error, but the unnecessary warning is no
      longer printed in the logfiles.
      
      The callers of this method are:
       * do_basil_release() - which already is set up to handle error/non-error case;
       * basil_safe_release() - this does not extra error checking, since it is called
         when trying to remove orphaned reservations, any failure in attempting to 
         release the reservation will result in repeated "orphaned ALPS reservation ..."
         messages.
      d26dc971
    • Moe Jette's avatar
      select/cray: always release reservation before calling apkill · 164ed8df
      Moe Jette authored
      This patch implements the same principle as an earlier one to fix issues
      when signalling aprun job steps via apkill: to avoid race conditions
      where further aprun lines get started while the current one is still in
      progress, always release the reservation first.
      164ed8df
    • Moe Jette's avatar
      select/cray: refactor Basil 4.0 table · 25f5fbc8
      Moe Jette authored
      This refactors the code to parse Basil 4.0 response data, removing
      code that is applicable to both Basil 3.1 and 4.0.
      25f5fbc8
    • Moe Jette's avatar
      Major update of Cray documentation. · f95315c9
      Moe Jette authored
      f95315c9
    • Moe Jette's avatar
      Add some more Cray-specific tools. · fbe72f99
      Moe Jette authored
      fbe72f99
    • Moe Jette's avatar
      updated libapls test logic and add more notes · ed422a91
      Moe Jette authored
      ed422a91
    • Moe Jette's avatar
      -- Added contribs/cray/libalps_test_programs.tar.gz with tools to validate · 2bf4d43e
      Moe Jette authored
          SLURM's logic used to support Cray systems.
      2bf4d43e
    • Moe Jette's avatar
      modify srun wrapper to take single character options without a space between · c3c4e48b
      Moe Jette authored
      the key and value (e.g. "-N2" gets translated to "-N 2" for the perl parser).
      c3c4e48b
  2. 16 Apr, 2011 8 commits
  3. 15 Apr, 2011 3 commits
  4. 14 Apr, 2011 7 commits
  5. 13 Apr, 2011 5 commits
  6. 12 Apr, 2011 3 commits
  7. 11 Apr, 2011 6 commits