1. 19 Apr, 2011 3 commits
  2. 18 Apr, 2011 6 commits
  3. 17 Apr, 2011 13 commits
    • Moe Jette's avatar
      adds a bunch of cosmetic changes from Gerrit. · 24a02ef9
      Moe Jette authored
      24a02ef9
    • Moe Jette's avatar
      job_submit/lua: expose priority-related job_record fields to lua interface · b22385b7
      Moe Jette authored
      This allows scripted modification of job records, by exposing the
       * job_ptr->direct_set_prio
       * job_ptr->priority
       * job_ptr->details->nice
      fields to the job_submit.lua script.
      b22385b7
    • Moe Jette's avatar
      slurmctld: allow job_submit plugin to modify/set the priority/nice values · 98d7059f
      Moe Jette authored
      This allows the job_submit plugin to directly set priority values. If it
      assigns a priority value different from 0 and NO_VAL, the priority is marked
      as "fixed" via job_ptr->direct_set_prio.
      
      To enable this, the permission check for directly set priority is now done
      before calling the job_submit plugin, which in addition also allows to 
      influence the nice value of the job via the plugin.
      98d7059f
    • Moe Jette's avatar
      job_submit: allow job_submit plugin to put job on hold · fecf4769
      Moe Jette authored
      This reorders the code of _job_create() to the effect that the job_submit plugin
      is able to put a job on hold (by setting the job priority to 0). To prevent the
      user from releasing such jobs, jobs put on hold by the job_submit plugin use
      WAIT_HELD rather than WAIT_HELD_USER.
      fecf4769
    • Moe Jette's avatar
      select/cray: unconditionally release reservations · 6d36b50c
      Moe Jette authored
      This increases robustness in releasing ALPS reservations. Previously
      the reservation was only released through
       * select_g_job_fini() for interactive (salloc) sessions;
       * batch_finish() by slurmstepd for batch sessions.
      
      This introduces a single point of failure for batch jobs, since a failure
      of batch_finish() would mean that the reservation could only be released
      much later, through the detection of orphaned ALPS reservations in
      basil_inventory().
      
      For batch jobs that terminate normally this means that the RELEASE method is
      called twice: first in job_complete(), and then in batch_finish(). The Basil
      1.2 design document by Ben Landsteiner (dated 15 Feb 2011) suggests in section
      3.3.5 repeated calls of RELEASE as one possible way of improving the response
      of the RELEASE method. There will be additional "entry not found" messages in
      the apschedMMDD logs, but (due to the preceding patch) not in the SLURM logs.
      
      For jobs that have to be terminated (e.g. job_timed_out, job_requeue, job_fail),
      this patch will mean that the RELEASE is called much sooner and thus is 
      expected to improve efficiency.
      
      For interactive salloc sessions that are cancelled via scancel, there is now
      no longer a warning message about the no longer existing ALPS reservation
      (since the release happens first through select_p_job_signal and then through
       job_complete -> deallocate_nodes -> select_p_job_fini).
      6d36b50c
    • Moe Jette's avatar
      select/cray: special case for "no such resId" · d26dc971
      Moe Jette authored
      For robustness, it does make sense calling the RELEASE method multiple times.
      
      The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in
      section 3.3.5, "Improve RELEASE method response" the following:
      
       "Periodically send RELEASE method requests until the RELEASE method
        response indicates the reservation is gone (via an error response)".
      
      The typical error message for this case is (also shown in the document on page 11):
      
      [2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087
      [2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543
      
      There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel:
       1. scancel causes job_signal() to be invoked,
       2. job_signal() defers to select_g_job_signal(),
          - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL,
          - at this stage the reservation is already released,
       3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(),
       4. this calls select_g_job_fini(),
          - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(),
          - since at this stage the reservation has already been released, the error message results.
      
      I don't like this error message myself, but avoiding it in all call paths is complicated. Also,
      for robustness, I would very much prefer that do_basil_release() is always called from
      deallocate_nodes().
      
      Hence this patch creates a custom error class for the case "no entry for resId xxx". This
      allows the calling function to still catch the error, but the unnecessary warning is no
      longer printed in the logfiles.
      
      The callers of this method are:
       * do_basil_release() - which already is set up to handle error/non-error case;
       * basil_safe_release() - this does not extra error checking, since it is called
         when trying to remove orphaned reservations, any failure in attempting to 
         release the reservation will result in repeated "orphaned ALPS reservation ..."
         messages.select/cray: special case for "no such resId"
      
      For robustness, it does make sense calling the RELEASE method multiple times.
      
      The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in
      section 3.3.5, "Improve RELEASE method response" the following:
      
       "Periodically send RELEASE method requests until the RELEASE method
        response indicates the reservation is gone (via an error response)".
      
      The typical error message for this case is (also shown in the document on page 11):
      
      [2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087
      [2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543
      
      There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel:
       1. scancel causes job_signal() to be invoked,
       2. job_signal() defers to select_g_job_signal(),
          - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL,
          - at this stage the reservation is already released,
       3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(),
       4. this calls select_g_job_fini(),
          - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(),
          - since at this stage the reservation has already been released, the error message results.
      
      I don't like this error message myself, but avoiding it in all call paths is complicated. Also,
      for robustness, I would very much prefer that do_basil_release() is always called from
      deallocate_nodes().
      
      Hence this patch creates a custom error class for the case "no entry for resId xxx". This
      allows the calling function to still catch the error, but the unnecessary warning is no
      longer printed in the logfiles.
      
      The callers of this method are:
       * do_basil_release() - which already is set up to handle error/non-error case;
       * basil_safe_release() - this does not extra error checking, since it is called
         when trying to remove orphaned reservations, any failure in attempting to 
         release the reservation will result in repeated "orphaned ALPS reservation ..."
         messages.
      d26dc971
    • Moe Jette's avatar
      select/cray: always release reservation before calling apkill · 164ed8df
      Moe Jette authored
      This patch implements the same principle as an earlier one to fix issues
      when signalling aprun job steps via apkill: to avoid race conditions
      where further aprun lines get started while the current one is still in
      progress, always release the reservation first.
      164ed8df
    • Moe Jette's avatar
      select/cray: refactor Basil 4.0 table · 25f5fbc8
      Moe Jette authored
      This refactors the code to parse Basil 4.0 response data, removing
      code that is applicable to both Basil 3.1 and 4.0.
      25f5fbc8
    • Moe Jette's avatar
      Major update of Cray documentation. · f95315c9
      Moe Jette authored
      f95315c9
    • Moe Jette's avatar
      Add some more Cray-specific tools. · fbe72f99
      Moe Jette authored
      fbe72f99
    • Moe Jette's avatar
      updated libapls test logic and add more notes · ed422a91
      Moe Jette authored
      ed422a91
    • Moe Jette's avatar
      -- Added contribs/cray/libalps_test_programs.tar.gz with tools to validate · 2bf4d43e
      Moe Jette authored
          SLURM's logic used to support Cray systems.
      2bf4d43e
    • Moe Jette's avatar
      modify srun wrapper to take single character options without a space between · c3c4e48b
      Moe Jette authored
      the key and value (e.g. "-N2" gets translated to "-N 2" for the perl parser).
      c3c4e48b
  4. 16 Apr, 2011 8 commits
  5. 15 Apr, 2011 3 commits
  6. 14 Apr, 2011 7 commits