1. 23 May, 2019 4 commits
  2. 22 May, 2019 2 commits
    • Marshall Garey's avatar
      Use correct rank for cloud stepd's. · e7d4d593
      Marshall Garey authored
      Job steps that run on cloud nodes and use the alias_list - in other
      words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk
      directly back to the slurmctld. To make that happen, we set the parent
      tank of each stepd to -1. However, we also set the rank of each stepd to
      0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to
      the slurmctld, they would tell slurmctld to clean up node 0 in the step
      allocation. So, multi-node step allocations weren't cleaning up after
      the steps completed and would cause subsequent job steps to hang. The
      step allocations would only clean up properly at the end of the job.
      
      Ensure that each stepd uses the correct rank so that job steps are
      properly cleaned up after each step completes.
      
      Bug 6467.
      e7d4d593
    • Alejandro Sanchez's avatar
      Move two NEWS entries to appropriate maintenance release. · 09a7da34
      Alejandro Sanchez authored
      They were associated to these two commits:
      
      b4d7de48
      6871185a
      
      Bug 5562.
      09a7da34
  3. 21 May, 2019 3 commits
  4. 17 May, 2019 2 commits
  5. 16 May, 2019 1 commit
    • Marshall Garey's avatar
      Fix archive loading events. · 0d0f9deb
      Marshall Garey authored
      There was a syntax error in the mysql for inserting the event records
      into the event table caused by commit 3d61b6aa. The syntax error was
      a semicolon in the middle of the query, for example:
      
      insert into "voyager_event_table" (time_start, time_end, node_name,
      cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
      '1539298628', 'v1', '', 'cold-start', '1017', '0',
      '1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...
      
      Bug 7025.
      0d0f9deb
  6. 13 May, 2019 1 commit
  7. 10 May, 2019 2 commits
    • Marshall Garey's avatar
      Only archive 50k records at a time. · ddd49896
      Marshall Garey authored
      Trying to archive too many records at once can result in archive files
      that are too big to read or even too big to be written. Only archive 50k
      records at a time, like we only purge 50k records at a time.
      
      Bug 6033.
      ddd49896
    • Marshall Garey's avatar
      Handle duplicate archive file names. · 1e234c3d
      Marshall Garey authored
      The time period of the archive file currently depends on submit or start
      time and whether the purge period is in hours, days, or months.
      Previously, if the archive file name already exists, we would overwrite
      the old archive file with the assumption that these are duplicate
      records being archived after an archive load. However, that could result
      in lost records in a couple of ways:
      
        * If there were runaway jobs that were part of an old archive file's
        time period and are later fixed and then purged, the old file would
        be overwritten.
        * If jobs or steps are purged but there are still jobs or steps in
        that time period that are pending or running, the pending or running
        jobs and steps won't be purged. When they finish and are purged, the
        old file would be overwritten.
      
      Instead of overwriting the old file, we append a number to the file name
      to create a new file. This will also be important in an upcoming commit.
      
      Bug 6033.
      1e234c3d
  8. 06 May, 2019 1 commit
    • Felip Moll's avatar
      Fix seff memory display overflow · bab13dfd
      Felip Moll authored
      When tres_usage_in_max field is empty it is recorded as '' in the database
      which leads find_tres_count_in_string() to return an INFINITE64. Seff treats
      INIFINITE64 as a valid value. This patch fixes this issue.
      
      Bug 6817
      bab13dfd
  9. 03 May, 2019 1 commit
  10. 02 May, 2019 2 commits
    • Broderick Gardner's avatar
      Fix resubmit to sibling default on fed requeue · 822fe77e
      Broderick Gardner authored
      On requeue, the origin cluster job record is copied to submit
      to sibling clusters. If the job was originally submitted
      to accept cluster default account, partition, etc, those fields
      are now filled in on the origin. Here we add flags to indicate
      that those fields need to be cleared on resubmission to siblings.
      Bug 6064
      822fe77e
    • Broderick Gardner's avatar
      Fix clearing federation cluster lock on requeue · 47909f8e
      Broderick Gardner authored
      This is a holdover from when the fed job_info list was added.
      The cluster lock has to be cleared from both the job_ptr and
      the job_info.
      Bug 6064
      47909f8e
  11. 30 Apr, 2019 1 commit
  12. 29 Apr, 2019 5 commits
  13. 26 Apr, 2019 3 commits
  14. 24 Apr, 2019 3 commits
  15. 23 Apr, 2019 2 commits
  16. 22 Apr, 2019 1 commit
  17. 18 Apr, 2019 4 commits
  18. 16 Apr, 2019 2 commits