1. 08 Jan, 2013 6 commits
    • Morris Jette's avatar
      Merge branch 'slurm-2.5' · e5c8de12
      Morris Jette authored
      e5c8de12
    • Rod Schultz's avatar
      Report node state as MAINT only if not allocated jobs · 2af5ce33
      Rod Schultz authored
      One of our testers has observed that when a long running job continues to run after a maintenance reservation comes into effect sinfo reports the node as being in the allocated state while scontrol shows it to be in the maintenance state.
      
      This can happen when a node is not completely allocated. (select cons_res, a partition which is not Shared=EXCLUSIVE, jobs allocated without –exclusive, or jobs that are allocated only some of the cpus on a node.)
      
      Execution paths leading up to calls to node_state_string  (slurm_protocol_defs.c) or node_state_string_compact, in scontrol, test for allocated_cpus less that total_cpus on the node and set the node state to MIXED rather than ALLOCATED, while similar paths in sinfo do not.
      
      I think this is probably a bug, since the mixed state is defined and think it is desirable that both command return the same result.
      
      The problem can be fixed with two logic changes (in multiple places)
      
      1)        node_state_string and node_state_string_compact have to check for mixed as well as allocated before returning the MAINT state. This means that the reported state for the node with the allocated job will be MIXED.
      
      2)        Sinfo must also check allocated_cpus less than total_cpus and set the state to MIXED before calling either node_state_string or node_state_string_compact.
      
      The attached patch (against 2.5.1) makes these changes. The attached script is a test case.
      2af5ce33
    • Morris Jette's avatar
    • Morris Jette's avatar
      Added support for job arrays. · 2993b423
      Morris Jette authored
      Phase 1 of effort. See "man sbatch" option -a/--array option for details.
      Creates job records using sbatch. Reports job arrays using scontrol or
      squeue. More work coming soon...
      2993b423
    • Danny Auble's avatar
      Get rid of errors when using 64 bit bitmaps (nothing sets USE_64BIT_BITSTR · 18c9ecd7
      Danny Auble authored
      today) so bitmaps are always 32bits.  If one would like to use 64bit
      bitmaps just #define USE_64BIT_BITSTR in config.h.
      18c9ecd7
    • Danny Auble's avatar
      Convert hostlist functions on a multi dimensional system to use a bitmap · eb7500c9
      Danny Auble authored
      instead of a large array.  This appears to speed up the process a big deal
      before we were seeing times of over 6000 usecs just to memset the array
      for a 5D system.  With this patch on average the whole process takes
      around 1000 usecs with many being way under that.
      eb7500c9
  2. 07 Jan, 2013 1 commit
  3. 04 Jan, 2013 5 commits
    • jette's avatar
      Use local no-mem functions · 3a6bd336
      jette authored
      Make sure out of memory gets logged properly for slurmctld in foreground
      
      Fix slurmd and slurmdbd to log out of memory to stdout in foreground
      3a6bd336
    • jette's avatar
      Use local no-mem functions · 5e1d0210
      jette authored
      5e1d0210
    • Mark A. Grondona's avatar
      mpi/mvapich: Don't set MPIRUN_PROCESSES by default · fd5b0e56
      Mark A. Grondona authored
      The MPIRUN_PROCESSES variable set by the mpi/mvapich plugin probably
      is not needed for most if not all recent versions of mvapich.
      This environment variable also negatively affects job scalability
      since its length is proportional to the number of tasks in a job.
      In fact, for very large jobs, the increased environment size can
      lead to failures in execve(2).
      
      Since MPIRUN_PROCESSES *might* be required in some older versions of
      mvapich, this patch disables the setting of that variable completely
      only if SLURM_NEED_MVAPICH_MPIRUN_PROCESSES is not set in the job's
      environment. (Thus, by default MPIRUN_PROCESSES is disabled, but
      the old behavior may be restored by setting the environment variable
      above)
      fd5b0e56
    • jette's avatar
      b196f153
    • jette's avatar
      Fix logic in hostset_create for invalid input · 33cb1e40
      jette authored
      33cb1e40
  4. 03 Jan, 2013 16 commits
  5. 02 Jan, 2013 1 commit
    • Morris Jette's avatar
      Revert commit b2c18ec1 · ac27d503
      Morris Jette authored
      The original patch works fine to avoid cancelling a job when all
      of it's nodes go unresponsive, but I don't see any way to easily
      address nodes coming back into service. We want to cancel jobs
      that have some up nodes and some down nodes, but the nodes will
      come back into service indivually rather than all at once.
      ac27d503
  6. 31 Dec, 2012 1 commit
  7. 29 Dec, 2012 3 commits
  8. 28 Dec, 2012 7 commits