1. 27 Apr, 2011 9 commits
    • Morris Jette's avatar
      slurmstepd: use (SGI) container ID to confirm ALPS reservation · 35f5133e
      Morris Jette authored
      This equips the slurmstepd to use the SGI process aggregate container
      ID to confirm the ALPS reservation.
      
      The way it is coded allows ALPS to return a temporary error in confirming the
      reservation once, the job will then be requeued.
      
      On the other hand, it is mandatory that the Cray job service works correctly,
      therefore errors are returned if
       * job container creation fails or
       * the job is not attached to a container (anticipating later failure in
         _fork_all_tasks() when slurm_container_add() will fail for the same reason).
      
      The patch relies on the internals of the proctrack/sgi_job plugin in order to
      avoid duplicating code. This dependency is made explicit by a configuration
      check of a subsequent patch.
      
      With these two pieces in place, the frontends are set to DRAINING if a system
      administrator forgets to enable the /etc/init.d/job service, as shown in the
      following log entries:
      
      slurmd log:
      [2011-04-26T14:56:04] Launching batch job 134 for UID 21215
      [2011-04-26T14:56:04] [134] no PAGG ID: job service disabled on this host?
      [2011-04-26T14:56:04] [134] could not confirm ALPS resId 253
      [2011-04-26T14:56:04] [134] job_manager exiting abnormally, rc = 4014
      
      slurmctld log:
      [2011-04-26T14:56:03] ALPS RESERVATION #253, JobId 134: BASIL -n 2 -N 1 -d 1 -m 16000
      [2011-04-26T14:56:03] sched: Allocate JobId=134 NodeList=nid000[16-17] #CPUs=24
      [2011-04-26T14:56:04] error: slurmd error 4014 running JobId=134 on \
      			front_end=gele2: Slurmd could not set up environment for batch job
      [2011-04-26T14:56:04] update_front_end: set state of gele2 to DRAINING
      [2011-04-26T14:56:04] completing job 134
      35f5133e
    • Morris Jette's avatar
      select/cray: use pagg ID to confirm reservations · ac21f730
      Morris Jette authored
      This uses the SGI container process aggregate ID to confirm the job reservation.
      
      It falls back to using the alloc_sid in case of failure. This fallback
      should be considered really only as a last resort, since it is known that
      session IDs are not unique across multiple login nodes and therefore the
      confirmation of ALPS reservations will fail whenever there is a SID collision
      (the likelihood increases with system size).
      ac21f730
    • Morris Jette's avatar
    • Morris Jette's avatar
      select/cray: add data structure support for confirmation cookie · af59681f
      Morris Jette authored
      This extends the Cray-specific select_jobinfo struct with a confirmation cookie
      field, which is to be used by later patches to store the session SID or PAGG
      container ID.
      
      There is a slight incompatibility with regard to pack/unpack, due to the new
      confirm_cookie format. Since not many Cray installations exist yet, I would
      like to suggest to not do an extra bump of the API version.
      af59681f
    • Morris Jette's avatar
      ck/cgroup: return value convention of slurm_container_find · 496a256b
      Morris Jette authored
      This is just for consistency with other proctrack plugins, which all return 0
      to indicate "not found", rather than (uint64_t)-1.
      496a256b
    • Morris Jette's avatar
      select/cray: set default TopologyPlugin · b56260ef
      Morris Jette authored
      The select/cray plugin discovers the topology as part of its initialisation and
      generates a node ranking. No further topology information is required by the
      plugin, hence this patch sets the default TopologyPlugin to topology/none.
      b56260ef
    • Danny Auble's avatar
      modified cray.shtml · 0fcaaf46
      Danny Auble authored
      0fcaaf46
    • Danny Auble's avatar
      14fed23d
    • Danny Auble's avatar
  2. 26 Apr, 2011 19 commits
  3. 25 Apr, 2011 10 commits
  4. 23 Apr, 2011 1 commit
  5. 22 Apr, 2011 1 commit