1. 23 Mar, 2016 3 commits
    • Morris Jette's avatar
      task/cgroup: Fix for task binding anomaly · efa83a02
      Morris Jette authored
      Here's how to reproduce on smd-server with 2 sockets, 6 cores per
      socket and 2 threads per core, just run the following command line
      3 times in quick succession (all active at the same time):
      srun --cpus-per-task=4 -m block sleep 30
      What was happening is the first job would be allocated cores 0+1
      The second job would be allocated cores 2+3
      The thrid job would test use of cores 0-3 then exit because the
       job only needs 4 CPUs. The resulting core binding would include
       NO CPUs. The new logic tests that the core being considered for
       use actually has some resources available to the job before
       updating the counter which is being tested against the needed
       CPU counter.
      efa83a02
    • Morris Jette's avatar
      task/cgroup: Fix for task layout logic when disabled resources. · 6c14b969
      Morris Jette authored
      Specifically add the HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM flag when
      loading configuration from HWLOC library. Previous logic in
      task/cgroup did not do this, which was different behaviour from
      how slurmd gets configuration information. Here's the HWLOC
      documentation:
      HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM
      Detect the whole system, ignore reservations and offline settings.
      Gather all resources, even if some were disabled by the administrator.
      For instance, ignore Linux Cpusets and gather all processors and memory
      nodes, and ignore the fact that some resources may be offline.
      
      Without this flag, I was rarely observing a bad core count, which
      resulted in the logic layout out tasks wrong and generating an error:
      task/cgroup: task[0] infinite loop broken while trying to provision compute elements using cyclic
      
      bug 2502
      6c14b969
    • Danny Auble's avatar
  2. 22 Mar, 2016 1 commit
  3. 21 Mar, 2016 3 commits
  4. 18 Mar, 2016 2 commits
    • Tim Wickberg's avatar
      Fix typo. · 886df85b
      Tim Wickberg authored
      886df85b
    • Morris Jette's avatar
      Fix for srun abort on SIGSTOP+SIGCONT · 1ed38f26
      Morris Jette authored
      Avoid possibly aborting srun that gets simultaneous SIGSTOP+SIGCONT while
          creating the job step. The result is that the signal hanlder gets a
          argument (the signal received) of zero.
      
      Here's a log, window 1:
      $ srun hostname
      srun: Job step creation temporarily disabled, retrying
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 0
      srun: Cancelled pending job step
      
      Window 2:
      $  kill -STOP 18696 ; kill -CONT 18696
      $  kill -STOP 18696 ; kill -CONT 18696
      $  kill -STOP 18696 ; kill -CONT 18696
      ....
      
      bug 2494
      1ed38f26
  5. 17 Mar, 2016 2 commits
  6. 16 Mar, 2016 6 commits
  7. 15 Mar, 2016 5 commits
  8. 14 Mar, 2016 3 commits
  9. 11 Mar, 2016 2 commits
  10. 10 Mar, 2016 2 commits
  11. 09 Mar, 2016 2 commits
    • Morris Jette's avatar
      cray job requeue bug · fec5e03b
      Morris Jette authored
      Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
      allocated to a requeued job as non-usable on job termination.
      
      Specifically, each job has a "cleaning/cleaned" flag. Once a job
      terminates, the cleaning flag is set, then after the job node health
      check completes, the value gets set to cleaned. If the job is requeued,
      on its second (or subsequent) termination, the select/cray plugin
      is called to launch the NHC. The plugin sees the "cleaned" flag
      already set, it then logs:
      error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
      and returns, never launching the NHC. Since the termination of the
      job NHC triggers releasing job resources (CPUs, memory, and GRES),
      those resources are never released for use by other jobs.
      
      Bug 2384
      fec5e03b
    • David Gloe's avatar
      Correctly parse nids in slurmconfgen_smw.py · 88ccc111
      David Gloe authored
      An error in slurmconfgen_smw.py caused it to parse the nic as the nid.
      On some systems those values differ, causing the generated slurm.conf file to
      be incorrect.
      
      Bug 2532.
      88ccc111
  12. 08 Mar, 2016 5 commits
  13. 07 Mar, 2016 1 commit
    • Tim Wickberg's avatar
      add additional tuning notes for mysql/mariadb · 49dc5d8d
      Tim Wickberg authored
      In particular, it seems that MariaDB has changed the default for
      innodb_lock_wait_timeout has been lowered which can cause issues
      for the various rollup processes on systems with high job counts.
      49dc5d8d
  14. 05 Mar, 2016 2 commits
  15. 04 Mar, 2016 1 commit