1. 22 May, 2019 3 commits
    • Ben Roberts's avatar
      Update Elastic Computing docs with TCPTimeout info · c06b1c27
      Ben Roberts authored
      Bug 6995
      c06b1c27
    • Marshall Garey's avatar
      Use correct rank for cloud stepd's. · e7d4d593
      Marshall Garey authored
      Job steps that run on cloud nodes and use the alias_list - in other
      words, SlurmctldParameters=cloud_dns is not in slurm.conf - all talk
      directly back to the slurmctld. To make that happen, we set the parent
      tank of each stepd to -1. However, we also set the rank of each stepd to
      0. this meant that when each stepd sent a REQUEST_STEP_COMPLETE RPC to
      the slurmctld, they would tell slurmctld to clean up node 0 in the step
      allocation. So, multi-node step allocations weren't cleaning up after
      the steps completed and would cause subsequent job steps to hang. The
      step allocations would only clean up properly at the end of the job.
      
      Ensure that each stepd uses the correct rank so that job steps are
      properly cleaned up after each step completes.
      
      Bug 6467.
      e7d4d593
    • Alejandro Sanchez's avatar
      Move two NEWS entries to appropriate maintenance release. · 09a7da34
      Alejandro Sanchez authored
      They were associated to these two commits:
      
      b4d7de48
      6871185a
      
      Bug 5562.
      09a7da34
  2. 21 May, 2019 6 commits
  3. 17 May, 2019 2 commits
  4. 16 May, 2019 2 commits
    • Marshall Garey's avatar
      Fix archive loading events. · 0d0f9deb
      Marshall Garey authored
      There was a syntax error in the mysql for inserting the event records
      into the event table caused by commit 3d61b6aa. The syntax error was
      a semicolon in the middle of the query, for example:
      
      insert into "voyager_event_table" (time_start, time_end, node_name,
      cluster_nodes, reason, reason_uid, state, tres) values ('1538669453',
      '1539298628', 'v1', '', 'cold-start', '1017', '0',
      '1=8,2=4000,5=8,1001=4,1002=1');, (<... another record>);, ...
      
      Bug 7025.
      0d0f9deb
    • Marshall Garey's avatar
      Fix regression caused by 34e9d41b. · c77d7895
      Marshall Garey authored
      This commit caused loading usage table archive files to fail.
      Specifically, wckey and assoc hourly/daily/monthly usage tables and the
      cluster usage tables archive files would all fail to load.
      
      Bug 7025.
      c77d7895
  5. 15 May, 2019 2 commits
  6. 13 May, 2019 1 commit
  7. 10 May, 2019 7 commits
    • Marshall Garey's avatar
      Document behavior of duplicate archive file names. · 7e7fd1bc
      Marshall Garey authored
      Bug 6033.
      7e7fd1bc
    • Marshall Garey's avatar
      Prevent infinite loop if 0 records are archived. · df5f748d
      Marshall Garey authored
      If _get_oldest_record() finds a record to archive/purge, then archive
      should always archive at least one record. If for whatever reason it
      fails to archive any records (_archive_table() returns a 0), then we
      don't want call continue, but want to return an error. Calling continue
      to go back to the beginning of the while loop would result in an
      infinite loop.
      
      Bug 6033.
      df5f748d
    • Marshall Garey's avatar
      Make archive job sql query consistent with purge. · 90471db8
      Marshall Garey authored
      Bug 6033.
      90471db8
    • Marshall Garey's avatar
      Only archive 50k records at a time. · ddd49896
      Marshall Garey authored
      Trying to archive too many records at once can result in archive files
      that are too big to read or even too big to be written. Only archive 50k
      records at a time, like we only purge 50k records at a time.
      
      Bug 6033.
      ddd49896
    • Marshall Garey's avatar
      Handle duplicate archive file names. · 1e234c3d
      Marshall Garey authored
      The time period of the archive file currently depends on submit or start
      time and whether the purge period is in hours, days, or months.
      Previously, if the archive file name already exists, we would overwrite
      the old archive file with the assumption that these are duplicate
      records being archived after an archive load. However, that could result
      in lost records in a couple of ways:
      
        * If there were runaway jobs that were part of an old archive file's
        time period and are later fixed and then purged, the old file would
        be overwritten.
        * If jobs or steps are purged but there are still jobs or steps in
        that time period that are pending or running, the pending or running
        jobs and steps won't be purged. When they finish and are purged, the
        old file would be overwritten.
      
      Instead of overwriting the old file, we append a number to the file name
      to create a new file. This will also be important in an upcoming commit.
      
      Bug 6033.
      1e234c3d
    • Marshall Garey's avatar
      Remove unused static variable high_buffer_size. · 3ffb4b4c
      Marshall Garey authored
      It was set but never read.
      
      Bug 6033.
      3ffb4b4c
    • Marshall Garey's avatar
      Use correct signed/unsiged types. · 4a26e486
      Marshall Garey authored
      Change a few variables in archiving to use the correct signed or
      unsigned type to avoid implicit casting.
      
      Bug 6033.
      4a26e486
  8. 09 May, 2019 1 commit
  9. 08 May, 2019 1 commit
    • Tim Wickberg's avatar
      Renumber newly added flags to avoid a conflict in 19.05. · 26ccbec1
      Tim Wickberg authored
      These conflict with JOB_MEM_SET/JOB_RESIZED in 19.05. Since 19.05rc1
      has shipped - but no 18.08 maintenance releases have shipped with these
      new flags - it is safer to renumber them here to avoid the merge conflict
      going into 19.05.
      
      Bug 6064.
      26ccbec1
  10. 06 May, 2019 1 commit
    • Felip Moll's avatar
      Fix seff memory display overflow · bab13dfd
      Felip Moll authored
      When tres_usage_in_max field is empty it is recorded as '' in the database
      which leads find_tres_count_in_string() to return an INFINITE64. Seff treats
      INIFINITE64 as a valid value. This patch fixes this issue.
      
      Bug 6817
      bab13dfd
  11. 03 May, 2019 1 commit
  12. 02 May, 2019 3 commits
  13. 30 Apr, 2019 1 commit
  14. 29 Apr, 2019 9 commits
    • Brian Christiansen's avatar
      Update test7.20 to catch passing/failing het jobs · 8c4fdffe
      Brian Christiansen authored
      when one offset passes and other fails.
      
      Bug 6892
      8c4fdffe
    • Nate Rini's avatar
      Add test7.20 · 1460a6b5
      Nate Rini authored
      Bug 6513.
      1460a6b5
    • Brian Christiansen's avatar
      Add NEWS for previous two commits · 00a8e724
      Brian Christiansen authored
      Bug 6513
      00a8e724
    • Brian Christiansen's avatar
      Fix bad sbatch het offset output · 4657ab94
      Brian Christiansen authored
      Bug 6513
      
      First offset is good but second is bad -- didn't request task count.
      
      $ cat etc/job_submit.lua
      function slurm_job_submit(job_desc, part_list, submit_uid)
              slurm.log_user("submit1\nstuff")
              slurm.log_user("submit2")
              slurm.log_user("submit3")
      
          -- slurm.log_user("case 0")
          if job_desc.num_tasks == slurm.NO_VAL or job_desc.num_tasks == nil then
              slurm.log_user("Batch submit error:  Must specify either number of nodes or number of tasks!")
              -- reject the job
              return slurm.ERROR
          end
      
              return slurm.SUCCESS
      end
      
      function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
              slurm.log_user("modify1")
              slurm.log_user("modify2")
              slurm.log_user("modify3")
              return slurm.SUCCESS
      end
      
      slurm.log_user("initialized")
      return slurm.SUCCESS
      
      $ sbatch -Ablah2 -n1 --wrap="hostname" : -J asdfl
      sbatch: error: 0: initialized
      sbatch: error: 0: submit1
      sbatch: error: 0: stuff
      sbatch: error: 0: submit2
      sbatch: error: 0: submit3
      sbatch: error: submit1
      sbatch: error: stuff
      sbatch: error: submit2
      sbatch: error: submit3
      sbatch: error: Batch submit error:  Must specify either number of nodes or number of tasks!
      sbatch: error: Batch job submission failed: Unspecified error
      
      $ sbatch -Ablah2 -n1 --wrap="hostname" : -J asdfl
      sbatch: error: 0: initialized
      sbatch: error: 0: submit1
      sbatch: error: 0: stuff
      sbatch: error: 0: submit2
      sbatch: error: 0: submit3
      sbatch: error: 1: submit1
      sbatch: error: 1: stuff
      sbatch: error: 1: submit2
      sbatch: error: 1: submit3
      sbatch: error: 1: Batch submit error:  Must specify either number of nodes or number of tasks!
      sbatch: error: Batch job submission failed: Unspecified error
      
      srun already handles this
      4657ab94
    • Nate Rini's avatar
      Break up packed job user messages to prepend index. · a415b8f6
      Nate Rini authored
      Was dumping this:
      $ srun -A test7.21-account.1 --qos test7.21-qos.1 -n5 : -n3 : -n1 /bin/true
      srun: error: 0: submit1
      srun: error: submit2
      srun: error: submit3
      srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
      
      Will now dump this:
      $ srun -A test7.21-account.1 --qos test7.21-qos.1 -n5 : -n3 : -n1 /bin/true
      srun: error: 0: initialized
      srun: error: 0: submit1
      srun: error: 0: submit2
      srun: error: 0: submit3
      srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
      
      Bug 6513.
      a415b8f6
    • Nate Rini's avatar
      Fix printing duplicate error messages of lua rejected jobs · 297a6880
      Nate Rini authored
      Regression from 70b4e06d.
      
      Bug 6892.
      297a6880
    • Nate Rini's avatar
      8920863a
    • Brian Christiansen's avatar
    • Brian Christiansen's avatar
      Fix unnecessary reloading of submit plugins · b50ac244
      Brian Christiansen authored
      Bug 6895
      b50ac244