1. 02 Dec, 2016 1 commit
  2. 01 Dec, 2016 1 commit
  3. 30 Nov, 2016 2 commits
    • Morris Jette's avatar
      cray/burst_buffer - Increase timer · b4763c75
      Morris Jette authored
      cray/burst_buffer - Increase time to synchronize operations between threads
          from 5 to 60 seconds ("setup" operation time observed over 17 seconds).
          This should fix a race condition between a thread performing a buffer
          creation (setup) and a thread looking for unexpected buffers. If a
          buffer is found during the time window allowed for creation, it's
          space will be counted twice. First by the status checking thread
          and second by the thread doing the creation. The deallocation only
          happens once, so the used space information can be left with an
          invalid value.
      bug 3295
      b4763c75
    • Tim Wickberg's avatar
      sbcast - prevent segfault in slurmd from multiple zlib compressed transfers · 8c5765c9
      Tim Wickberg authored
      static variable means multiple active decompression streams will corrupt
      zlib's internal state, which can lead to a segfault.
      
      Bug 3299.
      8c5765c9
  4. 29 Nov, 2016 3 commits
  5. 28 Nov, 2016 5 commits
  6. 22 Nov, 2016 7 commits
    • Morris Jette's avatar
      Added SchedulingParameters option of "bf_job_part_count_reserve" · 209822a8
      Morris Jette authored
      Added SchedulingParameters option of "bf_job_part_count_reserve". Jobs below
          the specified threshold will not have resources reserved for them.
      bug 3275
      209822a8
    • Danny Auble's avatar
      Make it so we don't purge job start messages until after we purge step · 178a929b
      Danny Auble authored
      messages.  Hopefully this will reduce the number of messages lost when
      filling up memory when the database/DBD is down.
      178a929b
    • Morris Jette's avatar
      Correct malloc data type · a12e1a1c
      Morris Jette authored
      sched/backfill plugin: Make malloc match data type (defined as uint32_t and
          allocated as int). No failures observed, if type "int" is smaller than
          "uint32_t", it could result in an invalid memory reference.
      a12e1a1c
    • Sergey Meirovich's avatar
      Fix slurm_job_cpus_allocated_str_on_node_id() API call. · 0ed6488e
      Sergey Meirovich authored
      Fix API call: slurm_job_cpus_allocated_str_on_node_id() and
      in turn slurm_job_cpus_allocated_str_on_node() to return correct
      results for anything but first node. This was caused by missed logic
      to calculate fist bit belongs to particular node. Lookup was always
      starting from bit 0.
      
      Bug 3266.
      0ed6488e
    • Morris Jette's avatar
      backfill algorithm logic · e089b63a
      Morris Jette authored
      After one second of wall time, simulate the termination of all remaining
         running jobs in order to respond in a reasonable time frame.
      bug 3275
      e089b63a
    • Morris Jette's avatar
      Modify backfill algorithm · 6008b021
      Morris Jette authored
      Modify backfill algorithm to improve performance with large numbers of
          running jobs. Group running jobs that end in a "similar" time frame using a
          time window that grows exponentially rather than linearly. The original
          window sizes were (in units of minutes):
          0, 1, 2, 3, 4, 5, 6, 7, ... minutes
          The new window sizes are:
          0.5, 1, 2, 4, 8, 16, 32, ... minutes
          This can dramatically reduce the number of instances where the very time
          consuming "can the pending job run now" operation is executed, especailly
          if there are 1000+ running jobs.
      bug 3275
      6008b021
    • Nicolas Joly's avatar
      testsuite - fix job id output in test17.39 · 44241006
      Nicolas Joly authored
      44241006
  7. 21 Nov, 2016 2 commits
  8. 18 Nov, 2016 1 commit
  9. 14 Nov, 2016 1 commit
  10. 13 Nov, 2016 1 commit
  11. 11 Nov, 2016 5 commits
  12. 10 Nov, 2016 3 commits
  13. 09 Nov, 2016 2 commits
  14. 08 Nov, 2016 5 commits
    • Morris Jette's avatar
      Upgrade "scontrol reboot" logic · 861bab6c
      Morris Jette authored
      Add new node state flag of NODE_STATE_REBOOT for node reboots triggered by
          "scontrol reboot" commands. Previous logic re-used NODE_STATE_MAINT flag,
          which could lead to inconsistencies. Add "ASAP" option to "scontrol reboot"
          command that will drain a node in order to reboot it as soon as possible,
          then return it to service.
      bug 3210
      861bab6c
    • Morris Jette's avatar
      Permit cancellation of jobs in configuring state. · 6957bd9f
      Morris Jette authored
      bug 3213
      6957bd9f
    • Morris Jette's avatar
      select/linear plugin modified to better support heterogeneous clusters · 243fbb0d
      Morris Jette authored
      select/linear plugin modified to better support heterogeneous clusters when
          topology/none is also configured. Note that use of the select/cons_res
          plugin is strongly recommended for heterogeneous clusters. The use of
          OverSubscribe=exclusive can be used if whole node allocations is
          desired.
      bug 3212
      243fbb0d
    • Alejandro Sanchez's avatar
      9e7e12dc
    • Morris Jette's avatar
      sched/backfill - avoid starting requeued job · 69af50af
      Morris Jette authored
      If a job is started by the main scheduling logic and requeued while
        the backfill scheduler has locks released, that can result in an
        invalid data structure in select/cons_res. Namely, the backfill
        scheduler's attempt to start the job would clear the job resources
        node_bitmap. That leaves a NULL pointer in the select/cons_res
        plugin generating an abort. (That pointer is needed to clean up
        the job allocation records when the Epilog or Cray Node Health
        Check, NHC, are complete and the resources become available for
        another job.
      bug 3230
      69af50af
  15. 07 Nov, 2016 1 commit