1. 29 Jan, 2011 14 commits
    • Moe Jette's avatar
      scontrol: disable wait_job on Cray systems · 6a06a145
      Moe Jette authored
      On Cray, wait_job means to confirm the already existing ALPS reservation. This
      is handled already:
       * for salloc by select_g_job_ready() - hence no need to call again;
       * for batch jobs it is done in the stepdmanager.
      Hence just print a warning to the user.
      
      13_scontrol-no-wait_job.diff
      6a06a145
    • Moe Jette's avatar
      salloc: add support for Cray · c036763e
      Moe Jette authored
      This adds support for execution of salloc on a local Cray system,
      disabling node sharing (still not supported on XT/XE).
      
      It further disables running salloc within salloc, as it leads to errors: since
      Cray uses process group / PAGG IDs for tracking its reservations, running
      salloc from within salloc invariably leads to a ALPS resource allocation error.
      
      Thirdly, it disable Cray node allocation on non-Cray systems, since this
      requires that the host on which salloc spawns the shell process is capable
      of Cray task launch.
      
      If it is not, then the remote slurmctld will reserve the requested nodes, but
      the local host runninc salloc will neither be able to confirm the ALPS 
      reservation (due to the absence of a local apbasil command), nor would it be
      able to run jobs on the compute nodes.
      
      To distinguish this case from general task launch (we use a frontend host where
      salloc could end up running jobs on different clusters, depending on the value
      exported via $SLURM_CONF), the following condition is tested:
      
       * Cray build support has been enabled (HAVE_CRAY);
       * the loaded slurm.conf uses select/cray (required on Cray hosts);
       * the local host does not have support for apbasil (HAVE_NATIVE_CRAY undefined).
      
      Since the 'apbasil' command is only available on native Cray systems, this
      combination of conditions seems sufficient to prevent accidentally using
      salloc on a host which does not support it.
      
      (For sbatch the case is different, since the job script runs on the remote host.)
      
      11_salloc.diff
      done with minor change for Cray emulation
      c036763e
    • Moe Jette's avatar
      select/cray: do the inventory immediately before each schedule · 100defe0
      Moe Jette authored
      This puts the Basil inventory immediately before each (backfill) schedule. 
      
      Having considered multiple alternatives, this is the most robust and least
      wasteful solution. The reason is that ALPS keeps internal node state, which
      can be changed
       * by the administrator (xtprocadmin),
       * by the node health checker programs (setting some nodes into 'suspect'),
       * by ALPS itself.
      
      Tracking this periodically, e.g. every HealthCheckInterval, may mean to miss
      some state changes. The result would not be a crash, but a subsequently
      failed ALPS reservation, which would require to undo some of the slurm state.
      
      Also added inventory to plugin/sched/wiki and wiki2 at get_node time
      
      09_Cray-INVENTORY-directly-before-schedule.diff
      100defe0
    • Moe Jette's avatar
      04_Cray-autoconf-rules.diff · a97bbf4f
      Moe Jette authored
      select/cray: update compile-time and runtime support for Cray build
      
      These changes update build support for Cray XT/XE:
       1. renamed '--cray-xt' into '--cray' since also XE systems are supported;
       2. autoconf rules to cover the various possible build cases:
          a) --enable-cray=off: HAVE_CRAY/HAVE_NATIVE_CRAY undefined,
          b) --enable-cray=on:  HAVE_CRAY defined
             b1) local host is a native Cray system: HAVE_NATIVE_CRAY defined
                 (requires installation of mysql-devel and libexpat-devel packages),
             b2) local host is not a native Cray system: the conditionally built
                 parts (basil_interface.c, libalps.la) are not built;
       3. updated configure logic:
          - since Cray support depends on mySQL, reordered tests in configure.ac,
          - reordered logic with regard to changes in (2),
          - an AM_CONDITIONAL to build native-Cray parts conditionally,
          - updated configure messages (XT/XE);
       4. run-time read_conf test to ensure use of select/cray is properly supported,
       5. an update of the NEWS file due to the change in (1) ==> may have a conflict
          in case you have a locally-updated copy.
      
      I have compile-tested the three possible scenarios in (2).
      a97bbf4f
    • Moe Jette's avatar
      -- Preserve NodeHostName when reordering nodes due to system topology. · 55ebc2dd
      Moe Jette authored
          03_Bug-fix_slurmctld-swap-both-NodeAddr-and-NodeHostname-when-reordering.diff
      55ebc2dd
    • Moe Jette's avatar
      01_Cray-scontrol-warning-node-update.diff · f8ca2840
      Moe Jette authored
      scontrol: warn user that base node state can not be changed on Cray
      
      The base node state (UP, DOWN, ALLOCATED, ...) is handled by ALPS and inferred
      from reading the output of ALPS inventory requests.
      
      To avoid inconsistencies, it is not possible for a user to alter this node state.
      This patch adds a warning to scontrol if a user wants to change node state through
      slurm:
      
      palu> scontrol update NodeName=nid00171 State=DOWN
      State=DOWN can not be changed through slurm: use native Cray tools such as e.g. xtprocadmin(8)
      
      The 'meta' states such as DRAIN can still be changed.
      f8ca2840
    • Moe Jette's avatar
      svn merge -r22275:22267 https://eris.llnl.gov/svn/slurm/trunk · 3e7505dd
      Moe Jette authored
      This reverses some patches from Gerrit that were old, going to work
      forward now from the start
      3e7505dd
    • Moe Jette's avatar
      -- Updated configure option "--enable-cray" to support interaction with Cray · 6d20c856
      Moe Jette authored
          XT/XE systems, and build on native Cray XT/XE systems (auto-detected).
          Building on native Cray systems requires the cray-MySQL-devel-enterprise
          rpm and expat XML parser library/headers.
      
      select/cray: update compile-time and runtime support for Cray build
      
      These changes update build support for Cray XT/XE:
       1. renamed '--cray-xt' into '--cray' since also XE systems are supported;
       2. autoconf rules to cover the various possible build cases:
          a) --enable-cray=off: HAVE_CRAY/HAVE_NATIVE_CRAY undefined,
          b) --enable-cray=on:  HAVE_CRAY defined
             b1) local host is a native Cray system: HAVE_NATIVE_CRAY defined
                 (requires installation of mysql-devel and libexpat-devel packages),
             b2) local host is not a native Cray system: the conditionally built
                 parts (basil_interface.c, libalps.la) are not built;
       3. updated configure logic:
          - since Cray support depends on mySQL, reordered tests in configure.ac,
          - reordered logic with regard to changes in (2),
          - an AM_CONDITIONAL to build native-Cray parts conditionally,
          - updated configure messages (XT/XE);
       4. run-time read_conf test to ensure use of select/cray is properly supported,
       5. an update of the NEWS file due to the change in (1) ==> may have a conflict
          in case you have a locally-updated copy.
      
      I have compile-tested the three possible scenarios in (2).
      6d20c856
    • Moe Jette's avatar
      -- Set Cray node order based upon ALPS_NIDORDER configuration. · 04bfa3c1
      Moe Jette authored
          03_Cray-BASIL-node-ranking.diff
      select/cray: perform node ranking
      
      This supplies the select function-pointer to request a reordering of nodes based
      on the current Cray node ordering. 
      
      The Cray node ordering is set internally via the ALPS_NIDORDER configuration 
      variables that controls the way ALPS considers nodes.
      
      This ordering in turn determines the order of nodes as the appear subsequently 
      in the Inventory output. The present patch exploits this fact and uses an
      auto-incrementing number to reflect the node ranking (counting is reversed 
      since the parser returns the nodes in stack/LIFO order).
      
      The node ranking is performed on slurmctld (re-)configuration, hence the tests
      are more stringent: exit if Inventory fails (this condition is extremely rare)
      and if no nodes are powered up (also a condition that can be cured by restarting
      slurmctld only when the system is ready).
      04bfa3c1
    • Moe Jette's avatar
      -- Preserve node's NodeHostName field when reordering for topology. · dbf26340
      Moe Jette authored
          03_node-reordering-NodeHostName.diff
      dbf26340
    • Moe Jette's avatar
      -- For Cray systems, resolve node attributes and coordinates from ALPS. · fd2dfdb9
      Moe Jette authored
          02_Cray-BASIL-node-attributes-and-coordinates.diff
      fd2dfdb9
    • Moe Jette's avatar
      -- Prevent changing a node's Reason or State on a Cray system. · 6e1842fa
      Moe Jette authored
          02_salloc-no-node-update.diff
      6e1842fa
    • Moe Jette's avatar
      Cray BASIL API: basic support added to the select/cray plugin. · 832898b7
      Moe Jette authored
          01_Cray-BASIL-basic-support.diff plus
          01_changes-from-first-revision-of-patch-01.diff
      832898b7
    • Moe Jette's avatar
      Do not attempt to read the batch script for non-batch jobs. This patch · d0093b8e
      Moe Jette authored
          eliminates some inappropriate error messages. 01_interactive-no-script.diff
      d0093b8e
  2. 28 Jan, 2011 3 commits
  3. 27 Jan, 2011 3 commits
  4. 26 Jan, 2011 3 commits
  5. 25 Jan, 2011 2 commits
  6. 24 Jan, 2011 2 commits
  7. 22 Jan, 2011 1 commit
  8. 21 Jan, 2011 5 commits
  9. 20 Jan, 2011 2 commits
  10. 19 Jan, 2011 3 commits
  11. 18 Jan, 2011 1 commit
  12. 15 Jan, 2011 1 commit