1. 12 Apr, 2016 1 commit
  2. 11 Apr, 2016 3 commits
  3. 09 Apr, 2016 1 commit
    • Morris Jette's avatar
      backfill scheduling enhancement · e62a9270
      Morris Jette authored
      When determining when a pending job will be able to start, rather
        than testing after removing each running job and trying to schedule
        the pending jobs, remove multiple jobs that all end about the
        same time before testing. This reduces the number of calls to
        the job placement logic, which is time consuming.
      e62a9270
  4. 08 Apr, 2016 1 commit
  5. 07 Apr, 2016 2 commits
  6. 06 Apr, 2016 7 commits
  7. 05 Apr, 2016 1 commit
    • Morris Jette's avatar
      Fix backfill scheduler race condition · d8b18ff8
      Morris Jette authored
      Fix backfill scheduler race condition that could cause invalid pointer in
          select/cons_res plugin. Bug introduced in 15.08.9, commit:
          efd9d35e
      
      The scenario is as follows
      1. Backfill scheduler is running, then releases locks
      2. Main scheduling loop starts a job "A"
      3. Backfill scheduler resumes, finds job "A" in its queue and
         resets it's partition pointer.
      4. Job "A" completes and tries to remove resource allocation record
         from select/cons_res data structure, but fails to find it because
         it is looking in the table for the wrong partition.
      5. Job "A" record gets purged from slurmctld
      6. Select/cons_res plugin attempts to operate on resource allocation
         data structure, finds pointer into the now purged data structure
         of job "A" and aborts or gets SEGV
      Bug 2603
      d8b18ff8
  8. 04 Apr, 2016 2 commits
  9. 02 Apr, 2016 2 commits
  10. 01 Apr, 2016 1 commit
    • Morris Jette's avatar
      Rename "Shared" to "OverSubscribe" · 5fe0915e
      Morris Jette authored
      Rename partition configuration from "Shared" to "OverSubscribe". Rename
          salloc, sbatch, srun option from "--shared" to "--oversubscribe". The old
          options will continue to function. Output field names also changed in
          scontrol, sinfo, squeue, and sview.
      5fe0915e
  11. 31 Mar, 2016 2 commits
  12. 30 Mar, 2016 5 commits
  13. 29 Mar, 2016 1 commit
  14. 28 Mar, 2016 4 commits
    • Danny Auble's avatar
      8ee976b4
    • Danny Auble's avatar
      When a stepd is about to shutdown and send it's response to srun · ea470f71
      Danny Auble authored
      make the wait to return data only hit after 500 nodes and configurable
      based on the TcpTimeout value.
      ea470f71
    • Morris Jette's avatar
      task/cgroup - Fix task binding to CPUs bug · ddf6d9a4
      Morris Jette authored
      There was a subtle bug in how tasks were bound to CPUs which could result
      in an "infinite loop" error. The problem was various socket/core/threasd
      calculations were based upon the resources allocated to a step rather than
      all resources on the node and rounding errors could occur. Consider for
      example a node with 2 sockets, 6 cores per socket and 2 threads per core.
      On the idle node, a job requesting 14 CPUs is submitted. That job would
      be allocted 4 cores on the first socket and 3 cores on the second socket.
      The old logic would get the number of sockets for the job at 2 and the
      number of cores at 7, then calculate the number of cores per socket at
      7/2 or 3 (rounding down to an integer). The logic layouting out tasks
      would bind the first 3 cores on each socket to the job then not find any
      remaining cores, report the "infinite loop" error to the user, and run
      the job without one of the expected cores. The problem gets even worse
      when there are some allocated cores on a node. In a more extreme case,
      a job might be allocated 6 cores on one socket and 1 core on a second
      socket. In that case, 3 of that job's cores would be unused.
      bug 2502
      ddf6d9a4
    • Morris Jette's avatar
      Fix for srun signal handling threading problem · c8d36dba
      Morris Jette authored
      This is a revision to commit 1ed38f26
      The root problem is that a pthread is passed an argument which is
      a pointer to a variable on the stack. If that variable is over-written,
      the signal number recieved will be garbage, and that bad signal
      number will be interpretted by srun to possible abort the request.
      c8d36dba
  15. 26 Mar, 2016 1 commit
    • Morris Jette's avatar
      Revert commit efa83a02 · c1dde86c
      Morris Jette authored
      The previous commit obviously fixed a problem, but introduced a different
      set of problems. This will be pursued later, perhaps in version 16.05.
      c1dde86c
  16. 25 Mar, 2016 3 commits
  17. 24 Mar, 2016 1 commit
  18. 23 Mar, 2016 2 commits
    • Morris Jette's avatar
      gang scheduling bug fix · 5f1e78f6
      Morris Jette authored
      Fix gang scheduling resource selection bug which could prevent multiple jobs
          from being allocated the same resources. Bug was introduced in 15.08.6,
          commit 44f491b8
      5f1e78f6
    • Morris Jette's avatar
      task/cgroup: Fix for task binding anomaly · efa83a02
      Morris Jette authored
      Here's how to reproduce on smd-server with 2 sockets, 6 cores per
      socket and 2 threads per core, just run the following command line
      3 times in quick succession (all active at the same time):
      srun --cpus-per-task=4 -m block sleep 30
      What was happening is the first job would be allocated cores 0+1
      The second job would be allocated cores 2+3
      The thrid job would test use of cores 0-3 then exit because the
       job only needs 4 CPUs. The resulting core binding would include
       NO CPUs. The new logic tests that the core being considered for
       use actually has some resources available to the job before
       updating the counter which is being tested against the needed
       CPU counter.
      efa83a02