1. 27 Apr, 2011 14 commits
    • Moe Jette's avatar
      Result of running autogen.sh on snowflake · 9b79edb5
      Moe Jette authored
      9b79edb5
    • Morris Jette's avatar
      Merged pull request #5 from SchedMD/master. · 2a75c50b
      Morris Jette authored
      assorted changes for Cray system use of proctrack/sgi_job
      2a75c50b
    • Morris Jette's avatar
      select/cray: add linking dependency for libjob.so · 0112a79b
      Morris Jette authored
      This adds a link test for libjob.so
       * salloc needs direct support (Makefile.am),
       * however, X_AC_SGI_JOB comes much later in configure.ac.
      
      (An alternative would be, since the libjob interface has practically not
       changed for 2.6 kernels, to integrate its 19 ioctls into slurm.)
      0112a79b
    • Morris Jette's avatar
      salloc: support for Cray container IDs · 0344cdf3
      Morris Jette authored
      This adds detection and use of SGI process aggregate job container IDs for
      salloc interactive sessions.
      
      The preferred and documented way to support this on a Cray system is by
      enabling the provided pam_job.so via /etc/pam.d/common-session.
      
      There is a header dependency on job.h. This file depends on the optional
      cray-libjob-devel package, which installs into /opt/cray/job/<version>.
      
      This package is however not always installed or may not be up-to-date.
      Hence the patch "cheats" by duplicating the known prototype of job_getjid().
      0344cdf3
    • Morris Jette's avatar
      select/cray: warn user of wrong ProctrackType · 4545a1f0
      Morris Jette authored
      To work properly, select/cray requires proctrack/sgi_job. In fact, due to the
      way the container functions are called by the slurmstepd, it will only work
      properly with this plugin.
      
      I have considered alternatives, such as falling back to using the SID to
      confirm the allocation. But this attempt to support configuration errors
      creates other problems, such as less stringent error checking.
      4545a1f0
    • Morris Jette's avatar
      slurmstepd: use (SGI) container ID to confirm ALPS reservation · 35f5133e
      Morris Jette authored
      This equips the slurmstepd to use the SGI process aggregate container
      ID to confirm the ALPS reservation.
      
      The way it is coded allows ALPS to return a temporary error in confirming the
      reservation once, the job will then be requeued.
      
      On the other hand, it is mandatory that the Cray job service works correctly,
      therefore errors are returned if
       * job container creation fails or
       * the job is not attached to a container (anticipating later failure in
         _fork_all_tasks() when slurm_container_add() will fail for the same reason).
      
      The patch relies on the internals of the proctrack/sgi_job plugin in order to
      avoid duplicating code. This dependency is made explicit by a configuration
      check of a subsequent patch.
      
      With these two pieces in place, the frontends are set to DRAINING if a system
      administrator forgets to enable the /etc/init.d/job service, as shown in the
      following log entries:
      
      slurmd log:
      [2011-04-26T14:56:04] Launching batch job 134 for UID 21215
      [2011-04-26T14:56:04] [134] no PAGG ID: job service disabled on this host?
      [2011-04-26T14:56:04] [134] could not confirm ALPS resId 253
      [2011-04-26T14:56:04] [134] job_manager exiting abnormally, rc = 4014
      
      slurmctld log:
      [2011-04-26T14:56:03] ALPS RESERVATION #253, JobId 134: BASIL -n 2 -N 1 -d 1 -m 16000
      [2011-04-26T14:56:03] sched: Allocate JobId=134 NodeList=nid000[16-17] #CPUs=24
      [2011-04-26T14:56:04] error: slurmd error 4014 running JobId=134 on \
      			front_end=gele2: Slurmd could not set up environment for batch job
      [2011-04-26T14:56:04] update_front_end: set state of gele2 to DRAINING
      [2011-04-26T14:56:04] completing job 134
      35f5133e
    • Morris Jette's avatar
      select/cray: use pagg ID to confirm reservations · ac21f730
      Morris Jette authored
      This uses the SGI container process aggregate ID to confirm the job reservation.
      
      It falls back to using the alloc_sid in case of failure. This fallback
      should be considered really only as a last resort, since it is known that
      session IDs are not unique across multiple login nodes and therefore the
      confirmation of ALPS reservations will fail whenever there is a SID collision
      (the likelihood increases with system size).
      ac21f730
    • Morris Jette's avatar
    • Morris Jette's avatar
      select/cray: add data structure support for confirmation cookie · af59681f
      Morris Jette authored
      This extends the Cray-specific select_jobinfo struct with a confirmation cookie
      field, which is to be used by later patches to store the session SID or PAGG
      container ID.
      
      There is a slight incompatibility with regard to pack/unpack, due to the new
      confirm_cookie format. Since not many Cray installations exist yet, I would
      like to suggest to not do an extra bump of the API version.
      af59681f
    • Morris Jette's avatar
      ck/cgroup: return value convention of slurm_container_find · 496a256b
      Morris Jette authored
      This is just for consistency with other proctrack plugins, which all return 0
      to indicate "not found", rather than (uint64_t)-1.
      496a256b
    • Morris Jette's avatar
      select/cray: set default TopologyPlugin · b56260ef
      Morris Jette authored
      The select/cray plugin discovers the topology as part of its initialisation and
      generates a node ranking. No further topology information is required by the
      plugin, hence this patch sets the default TopologyPlugin to topology/none.
      b56260ef
    • Danny Auble's avatar
      modified cray.shtml · 0fcaaf46
      Danny Auble authored
      0fcaaf46
    • Danny Auble's avatar
      14fed23d
    • Danny Auble's avatar
  2. 26 Apr, 2011 19 commits
  3. 25 Apr, 2011 7 commits