1. 28 Mar, 2016 3 commits
    • Danny Auble's avatar
      When a stepd is about to shutdown and send it's response to srun · ea470f71
      Danny Auble authored
      make the wait to return data only hit after 500 nodes and configurable
      based on the TcpTimeout value.
      ea470f71
    • Morris Jette's avatar
      task/cgroup - Fix task binding to CPUs bug · ddf6d9a4
      Morris Jette authored
      There was a subtle bug in how tasks were bound to CPUs which could result
      in an "infinite loop" error. The problem was various socket/core/threasd
      calculations were based upon the resources allocated to a step rather than
      all resources on the node and rounding errors could occur. Consider for
      example a node with 2 sockets, 6 cores per socket and 2 threads per core.
      On the idle node, a job requesting 14 CPUs is submitted. That job would
      be allocted 4 cores on the first socket and 3 cores on the second socket.
      The old logic would get the number of sockets for the job at 2 and the
      number of cores at 7, then calculate the number of cores per socket at
      7/2 or 3 (rounding down to an integer). The logic layouting out tasks
      would bind the first 3 cores on each socket to the job then not find any
      remaining cores, report the "infinite loop" error to the user, and run
      the job without one of the expected cores. The problem gets even worse
      when there are some allocated cores on a node. In a more extreme case,
      a job might be allocated 6 cores on one socket and 1 core on a second
      socket. In that case, 3 of that job's cores would be unused.
      bug 2502
      ddf6d9a4
    • Morris Jette's avatar
      Fix for srun signal handling threading problem · c8d36dba
      Morris Jette authored
      This is a revision to commit 1ed38f26
      The root problem is that a pthread is passed an argument which is
      a pointer to a variable on the stack. If that variable is over-written,
      the signal number recieved will be garbage, and that bad signal
      number will be interpretted by srun to possible abort the request.
      c8d36dba
  2. 26 Mar, 2016 1 commit
    • Morris Jette's avatar
      Revert commit efa83a02 · c1dde86c
      Morris Jette authored
      The previous commit obviously fixed a problem, but introduced a different
      set of problems. This will be pursued later, perhaps in version 16.05.
      c1dde86c
  3. 25 Mar, 2016 3 commits
  4. 24 Mar, 2016 1 commit
  5. 23 Mar, 2016 4 commits
    • Morris Jette's avatar
      gang scheduling bug fix · 5f1e78f6
      Morris Jette authored
      Fix gang scheduling resource selection bug which could prevent multiple jobs
          from being allocated the same resources. Bug was introduced in 15.08.6,
          commit 44f491b8
      5f1e78f6
    • Morris Jette's avatar
      task/cgroup: Fix for task binding anomaly · efa83a02
      Morris Jette authored
      Here's how to reproduce on smd-server with 2 sockets, 6 cores per
      socket and 2 threads per core, just run the following command line
      3 times in quick succession (all active at the same time):
      srun --cpus-per-task=4 -m block sleep 30
      What was happening is the first job would be allocated cores 0+1
      The second job would be allocated cores 2+3
      The thrid job would test use of cores 0-3 then exit because the
       job only needs 4 CPUs. The resulting core binding would include
       NO CPUs. The new logic tests that the core being considered for
       use actually has some resources available to the job before
       updating the counter which is being tested against the needed
       CPU counter.
      efa83a02
    • Morris Jette's avatar
      task/cgroup: Fix for task layout logic when disabled resources. · 6c14b969
      Morris Jette authored
      Specifically add the HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM flag when
      loading configuration from HWLOC library. Previous logic in
      task/cgroup did not do this, which was different behaviour from
      how slurmd gets configuration information. Here's the HWLOC
      documentation:
      HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM
      Detect the whole system, ignore reservations and offline settings.
      Gather all resources, even if some were disabled by the administrator.
      For instance, ignore Linux Cpusets and gather all processors and memory
      nodes, and ignore the fact that some resources may be offline.
      
      Without this flag, I was rarely observing a bad core count, which
      resulted in the logic layout out tasks wrong and generating an error:
      task/cgroup: task[0] infinite loop broken while trying to provision compute elements using cyclic
      
      bug 2502
      6c14b969
    • Danny Auble's avatar
  6. 21 Mar, 2016 2 commits
    • Morris Jette's avatar
      Change point where burst buffer env vars are set · 54f314e7
      Morris Jette authored
      burst_buffer/cray: Set environment variables just before starting job rather
          than at job submission time to reflect persistent buffers created or
          modified while the job is pending.
      bug 2545
      54f314e7
    • Danny Auble's avatar
      Fix deadlock issue with burst_buffer/cray when a newly created burst · dcfa6ec0
      Danny Auble authored
      buffer is found.
      
      Bug 2576
      
      What happened was a function was doing a double read lock which isn't
      awesome to begin with, but not really horrible (if all you are doing is
      read locks anyway).  The problem was after the first lock was locked a
      different thread was going for a write lock and so when the second
      read lock came in it created deadlocked.
      dcfa6ec0
  7. 18 Mar, 2016 2 commits
    • Morris Jette's avatar
      Added SchedulingParameters option of "bf_min_prio_reserve" · 45560872
      Morris Jette authored
      Jobs below the specified threshold will not have resources reserved for them.
      bug 2565
      45560872
    • Morris Jette's avatar
      Fix for srun abort on SIGSTOP+SIGCONT · 1ed38f26
      Morris Jette authored
      Avoid possibly aborting srun that gets simultaneous SIGSTOP+SIGCONT while
          creating the job step. The result is that the signal hanlder gets a
          argument (the signal received) of zero.
      
      Here's a log, window 1:
      $ srun hostname
      srun: Job step creation temporarily disabled, retrying
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 0
      srun: Cancelled pending job step
      
      Window 2:
      $  kill -STOP 18696 ; kill -CONT 18696
      $  kill -STOP 18696 ; kill -CONT 18696
      $  kill -STOP 18696 ; kill -CONT 18696
      ....
      
      bug 2494
      1ed38f26
  8. 17 Mar, 2016 2 commits
    • Morris Jette's avatar
      Change calculation of node's allocated CPUs · ec50cb2f
      Morris Jette authored
      Change how a node's allocated CPU count is calculated to avoid double
          counting CPUs allocated to multiple jobs at the same time.
          Previous logic would sum the maximum number of CPUs allocated by each
          partition for any time slice, which could double count CPUs allocated
          to multiple jobs. New logic ORs bitmap of allocated CPUs for every
          partition and time slice, then counts the total for a given node.
          This avoids double counting CPUs allocated to multiple jobs, but
          does not remove from the count CPUs which have been allocated to
          jobs which might be suspended by the gang scheduler (either for
          time slicing or preemption).
      ec50cb2f
    • Tim Wickberg's avatar
      Prevent uid update from corrupting assoc_hash table. · 60b58b70
      Tim Wickberg authored
      The uid is used as part of the hash function, must remove old reference
      and recalculate if it may change, otherwise _delete_assoc_hash
      will not find it again when the association is removed, causing
      slurmctld to segfault.
      
      Bug 2560.
      60b58b70
  9. 16 Mar, 2016 8 commits
  10. 15 Mar, 2016 2 commits
  11. 14 Mar, 2016 2 commits
  12. 12 Mar, 2016 1 commit
  13. 11 Mar, 2016 3 commits
  14. 10 Mar, 2016 5 commits
    • Morris Jette's avatar
      cray job requeue bug · 536c8451
      Morris Jette authored
      Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
      allocated to a requeued job as non-usable on job termination.
      
      Specifically, each job has a "cleaning/cleaned" flag. Once a job
      terminates, the cleaning flag is set, then after the job node health
      check completes, the value gets set to cleaned. If the job is requeued,
      on its second (or subsequent) termination, the select/cray plugin
      is called to launch the NHC. The plugin sees the "cleaned" flag
      already set, it then logs:
      error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
      and returns, never launching the NHC. Since the termination of the
      job NHC triggers releasing job resources (CPUs, memory, and GRES),
      those resources are never released for use by other jobs.
      
      Bug 2384
      536c8451
    • David Gloe's avatar
      Correctly parse nids in slurmconfgen_smw.py · e050806e
      David Gloe authored
      An error in slurmconfgen_smw.py caused it to parse the nic as the nid.
      On some systems those values differ, causing the generated slurm.conf file to
      be incorrect.
      
      Bug 2532.
      e050806e
    • Bill Brophy's avatar
      Fix route/topology plugin to prevent segfault in sbcast. · 0dfc924c
      Bill Brophy authored
      route_p_split_hostlist was not thread-safe, and would cause
      one of several segfaults depending on where in the initialization
      code each thread was.
      
      Bug 2495.
      0dfc924c
    • Tim Wickberg's avatar
      Fix displayed value for RoutePlugin. · db8491f1
      Tim Wickberg authored
      Was incorrectly displaying "(null)" even when loaded successfully.
      db8491f1
    • Morris Jette's avatar
      Add NEWS for commit 3bb2e602 · a0be0dc5
      Morris Jette authored
      a0be0dc5
  15. 09 Mar, 2016 1 commit
    • Morris Jette's avatar
      cray job requeue bug · fec5e03b
      Morris Jette authored
      Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
      allocated to a requeued job as non-usable on job termination.
      
      Specifically, each job has a "cleaning/cleaned" flag. Once a job
      terminates, the cleaning flag is set, then after the job node health
      check completes, the value gets set to cleaned. If the job is requeued,
      on its second (or subsequent) termination, the select/cray plugin
      is called to launch the NHC. The plugin sees the "cleaned" flag
      already set, it then logs:
      error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
      and returns, never launching the NHC. Since the termination of the
      job NHC triggers releasing job resources (CPUs, memory, and GRES),
      those resources are never released for use by other jobs.
      
      Bug 2384
      fec5e03b