1. 24 Aug, 2011 5 commits
  2. 23 Aug, 2011 1 commit
  3. 22 Aug, 2011 2 commits
  4. 19 Aug, 2011 1 commit
    • Morris Jette's avatar
      Treat duplicate switch name in topology.conf as fatal error · d2a30013
      Morris Jette authored
      One of our testers created an illegal topology.conf file.
      
      He has a config you probably wouldn't see in production, but can see in
      testing when you are sometimes given a collection of miscellaneous
      resources.
      
                |-- nodes
      switch1 --|
                |-- switch2 -- nodes
      
      He tried the topology.conf file below. Switch s1 is defined twice. Slurm
      accepted this config, but wouldn't allocate nodes from both switches to
      one job.
      
      SwitchName=s1 Nodes=xna[14-26]
      SwitchName=s2 Nodes=xna[41-43]
      SwitchName=s1 Switches=s2
      
      I believe slurm shouldn't allow the second definition of switch s1. The
      attached patch checks for duplicate switch names.
      Patch from Rod Schultz, Bull.
      d2a30013
  5. 17 Aug, 2011 1 commit
  6. 16 Aug, 2011 1 commit
  7. 12 Aug, 2011 2 commits
  8. 11 Aug, 2011 2 commits
  9. 10 Aug, 2011 3 commits
  10. 09 Aug, 2011 3 commits
    • Morris Jette's avatar
      Cray srun wrapper, map --share and --exclusive options · 08538cb8
      Morris Jette authored
      This change applies only to Cray systems and only when the srun
      wrapper for aprun. Map --exclusive to -F exclusive and --share to
      -F share. Note this does not consider the partition's Shared
      configuration, so it is an imperfect mapping of options.
      08538cb8
    • Morris Jette's avatar
      Cray DOWN node will be treated as transient condition · 493aa97a
      Morris Jette authored
      A node DOWN to ALPS will be marked DOWN to SLURM only after reaching
      SlurmdTimeout. In the interim, the node state will be NO_RESPOND. This
      change makes behavior makes SLURM handling of the node DOWN state more
      consistent with ALPS. This change effects only Cray systems.
      493aa97a
    • Morris Jette's avatar
      Fix node state acctg for cray. · acfa9aca
      Morris Jette authored
      Fix the node state accounting to be consistent with the node state
      set by ALPS.
      acfa9aca
  11. 05 Aug, 2011 2 commits
  12. 04 Aug, 2011 2 commits
    • Morris Jette's avatar
      Require SchedulerTimeSlice be at least 5 secs · c9b0eafe
      Morris Jette authored
      Require SchedulerTimeSlice configuration parameter to be at least 5 seconds
      to avoid thrashing slurmd daemon.
      Addresses Cray bug 774692
      c9b0eafe
    • Morris Jette's avatar
      Job step now gets all of job's GRES by default · 1078426e
      Morris Jette authored
      Change in GRES behavior for job steps: A job step's default generic
          resource allocation will be set to that of the job. If a job step's --gres
          value is set to "none" then none of the generic resources which have been
          allocated to the job will be allocated to the job step.
      Add srun environment value of SLURM_STEP_GRES to set default --gres value
          for a job step.
      1078426e
  13. 03 Aug, 2011 2 commits
  14. 02 Aug, 2011 2 commits
  15. 01 Aug, 2011 2 commits
  16. 29 Jul, 2011 1 commit
  17. 28 Jul, 2011 1 commit
    • Morris Jette's avatar
      Add ability to limit job's leaf switch count · 08e9f248
      Morris Jette authored
      Add the ability for a user to limit the number of leaf switches in a job's
      allocation using the --switch option of salloc, sbatch and srun. There is
      also a new SchedulerParameters value of max_switch_wait, which a SLURM
      administrator can used to set a maximum job delay and prevent a user job
      from blocking lower priority jobs for too long. Based on work by Rod
      Schultz, Bull.
      08e9f248
  18. 22 Jul, 2011 2 commits
  19. 21 Jul, 2011 1 commit
    • Morris Jette's avatar
      Restore node configuration information on slurmctld restart · f729d72b
      Morris Jette authored
      Restore node configuration information (CPUs, memory, etc.) for powered
      down when slurmctld daemon restarts rather than waiting for the node to be
      restored to service and getting the information from the node (NOTE: Only
      relevent if FastSchedule=0).
      f729d72b
  20. 20 Jul, 2011 1 commit
    • Morris Jette's avatar
      Fix select/cons_res task distribution bug · b70cc235
      Morris Jette authored
      Fix bug in select/cons_res task distribution logic when tasks-per-node=0.
      Eliminates misleading slurmctld message
      "error:  cons_res: _compute_c_b_task_dist oversubscribe."
      This problem was introduced in SLURM version 2.2.5 in order to fix
      a task distribution problem when cpus_per_task=0. Patch from Rod Schultz, Bull.
      b70cc235
  21. 14 Jul, 2011 1 commit
    • Morris Jette's avatar
      Set environment variables with job memory limtis · dbd292c7
      Morris Jette authored
      Set SLURM_MEM_PER_CPU or SLURM_MEM_PER_NODE environment variables for both
      interactive (salloc) and batch jobs if the job has a memory limit. For Cray
      systems also set CRAY_AUTO_APRUN_OPTIONS environment variable with the
      memory limit.
      dbd292c7
  22. 13 Jul, 2011 1 commit
    • Morris Jette's avatar
      limit batch jobs in front-end mode to a single CPU · 344daaa1
      Morris Jette authored
      For front-end configurations (Cray and IBM BlueGene), bind each batch job to
      a unique CPU to limit the damage which a single job can cause. Previously any
      single job could use all CPUs causing problems for other jobs or system
      daemons. This addresses a problem reported by Steve Trofinoff, CSCS.
      344daaa1
  23. 12 Jul, 2011 1 commit