1. 30 Nov, 2016 2 commits
    • Morris Jette's avatar
      cray/burst_buffer - Increase timer · b4763c75
      Morris Jette authored
      cray/burst_buffer - Increase time to synchronize operations between threads
          from 5 to 60 seconds ("setup" operation time observed over 17 seconds).
          This should fix a race condition between a thread performing a buffer
          creation (setup) and a thread looking for unexpected buffers. If a
          buffer is found during the time window allowed for creation, it's
          space will be counted twice. First by the status checking thread
          and second by the thread doing the creation. The deallocation only
          happens once, so the used space information can be left with an
          invalid value.
      bug 3295
      b4763c75
    • Tim Wickberg's avatar
      sbcast - prevent segfault in slurmd from multiple zlib compressed transfers · 8c5765c9
      Tim Wickberg authored
      static variable means multiple active decompression streams will corrupt
      zlib's internal state, which can lead to a segfault.
      
      Bug 3299.
      8c5765c9
  2. 29 Nov, 2016 1 commit
    • Alejandro Sanchez's avatar
      Fix SuspendExcNodes and SuspendExcParts on slurmctld SIGHUP. · bb06dd65
      Alejandro Sanchez authored
      On a reconfig, the exc_node_bitmap is cleared but then it was
      not built again since last_work_scan was declared as a local static
      variable in _do_power_work(). The fix is to make it global within the
      plugin and reinitialize it to 0 on _init_power_config().
      
      Bug 3078.
      bb06dd65
  3. 28 Nov, 2016 3 commits
  4. 22 Nov, 2016 5 commits
    • Morris Jette's avatar
      Correct malloc data type · a12e1a1c
      Morris Jette authored
      sched/backfill plugin: Make malloc match data type (defined as uint32_t and
          allocated as int). No failures observed, if type "int" is smaller than
          "uint32_t", it could result in an invalid memory reference.
      a12e1a1c
    • Sergey Meirovich's avatar
      Fix slurm_job_cpus_allocated_str_on_node_id() API call. · 0ed6488e
      Sergey Meirovich authored
      Fix API call: slurm_job_cpus_allocated_str_on_node_id() and
      in turn slurm_job_cpus_allocated_str_on_node() to return correct
      results for anything but first node. This was caused by missed logic
      to calculate fist bit belongs to particular node. Lookup was always
      starting from bit 0.
      
      Bug 3266.
      0ed6488e
    • Morris Jette's avatar
      backfill algorithm logic · e089b63a
      Morris Jette authored
      After one second of wall time, simulate the termination of all remaining
         running jobs in order to respond in a reasonable time frame.
      bug 3275
      e089b63a
    • Morris Jette's avatar
      Modify backfill algorithm · 6008b021
      Morris Jette authored
      Modify backfill algorithm to improve performance with large numbers of
          running jobs. Group running jobs that end in a "similar" time frame using a
          time window that grows exponentially rather than linearly. The original
          window sizes were (in units of minutes):
          0, 1, 2, 3, 4, 5, 6, 7, ... minutes
          The new window sizes are:
          0.5, 1, 2, 4, 8, 16, 32, ... minutes
          This can dramatically reduce the number of instances where the very time
          consuming "can the pending job run now" operation is executed, especailly
          if there are 1000+ running jobs.
      bug 3275
      6008b021
    • Nicolas Joly's avatar
      testsuite - fix job id output in test17.39 · 44241006
      Nicolas Joly authored
      44241006
  5. 14 Nov, 2016 1 commit
  6. 13 Nov, 2016 1 commit
  7. 11 Nov, 2016 3 commits
  8. 10 Nov, 2016 2 commits
  9. 09 Nov, 2016 2 commits
  10. 08 Nov, 2016 4 commits
  11. 07 Nov, 2016 1 commit
  12. 05 Nov, 2016 1 commit
  13. 04 Nov, 2016 2 commits
    • Morris Jette's avatar
      cray/burst_buffer - Preserve job ID · 42a90020
      Morris Jette authored
      cray/burst_buffer - Preserve job ID and don't translate to job array ID
        after slurmctld restart. Prior logic would not set array_task_id to
        NO_VAL, so all job-buffer IDs would be reported in the form
        "JobID=0_0(123)" rather than "JobID=123"
      42a90020
    • Morris Jette's avatar
      Burst_buffer/cray space tracking fix · 1548086f
      Morris Jette authored
      cray/busrt_buffer - Internally track both allocated and unusable space.
          The reported UsedSpace in a pool is now the allocated space (previously was
          unusable space). Base available space on whichever value leaves least free
          space.
      bug 3222
      1548086f
  14. 01 Nov, 2016 3 commits
  15. 28 Oct, 2016 1 commit
    • Danny Auble's avatar
      Fix issue in the priority/multifactor plugin where on a slurmctld restart · be924b88
      Danny Auble authored
      more time than should be allowed would be accounted for.
      
      This only happened on jobs in the completing state when the slurmctld
      was shutdown.
      
      This will also be enhanced in 17.02 as the job's end_time_exp is not
      stored which is needed to determine if the job has already been through
      the decay_thread at end of job.
      
      Bug 3162
      be924b88
  16. 27 Oct, 2016 4 commits
  17. 26 Oct, 2016 4 commits