1. 10 May, 2018 1 commit
    • Alejandro Sanchez's avatar
      Fix different issues when requesting memory per cpu/node. · bf4cb0b1
      Alejandro Sanchez authored
      
      
      First issue was identified on multi partition requests. job_limits_check()
      was overriding the original memory requests, so the next partition
      Slurm validating limits against was not using the original values. The
      solution consists in adding three members to job_details struct to
      preserve the original requests. This issue is reported in bug 4895.
      
      Second issue was memory enforcement behavior being different depending on
      job the request issued against a reservation or not.
      
      Third issue had to do with the automatic adjustments Slurm did underneath
      when the memory request exceeded the limit. These adjustments included
      increasing pn_min_cpus (even incorrectly beyond the number of cpus
      available on the nodes) or different tricks increasing cpus_per_task and
      decreasing mem_per_cpu.
      
      Fourth issue was identified when requesting the special case of 0 memory,
      which was handled inside the select plugin after the partition validations
      and thus that could be used to incorrectly bypass the limits.
      
      Issues 2-4 were identified in bug 4976.
      
      Patch also includes an entire refactor on how and when job memory is
      is both set to default values (if not requested initially) and how and
      when limits are validated.
      
      Co-authored-by: default avatarDominik Bartkiewicz <bart@schedmd.com>
      bf4cb0b1
  2. 09 May, 2018 9 commits
  3. 08 May, 2018 3 commits
    • Brian Christiansen's avatar
      Prevent slurmd from launching steps if prolog fail · 3b029021
      Brian Christiansen authored
      Bug 5146
      3b029021
    • Tim Wickberg's avatar
      Fix issue with invalid protocol_version when using srun on ppc64. · 77d65f4f
      Tim Wickberg authored
      Caused by a corrupted protocol_version field value being received
      by the slurmstepd, as we cannot safely write/read a uint16_t across
      the pipe as if it was an int.
      
      Regression caused by commit 90b116c2.
      
      Bug 5133.
      77d65f4f
    • Brian Christiansen's avatar
      Fix checkpointing requeued jobs in a bad state · f9f395af
      Brian Christiansen authored
      Requeued jobs are marked as PENDING|COMPLETING until the epilog checks
      in. The issue is that if job_set_alloc_tres gets called while in the
      PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If
      this job then gets checkpointed in this state (PENDING|COMPLETING + no
      tres_alloc_str) on startup the controller would crash because it
      expected the job to have a tres_alloc_str/cnt when in the COMPLETING
      state. This could be triggered if starting the controller without the
      dbd up. When the dbd comes up, the assoc_cache_mgr calls
      _update_job_tres() which calls job_set_alloc_tres. It could also be
      triggered by adding new tres.
      
      This most likely started happening in 17.11.5 because of commit
      865b672f which introduced calling _update_job_tres() on each job
      after the dbd comes up.
      
      Bugs 5137,4522
      f9f395af
  4. 04 May, 2018 1 commit
  5. 03 May, 2018 5 commits
  6. 02 May, 2018 6 commits
  7. 01 May, 2018 1 commit
    • Danny Auble's avatar
      Fix total TRES Billing on partitions. · 3686dd9c
      Danny Auble authored
      Turns out the partititon's billing tres was working off the sum of
      the node_ptrs which contain the max of all partitions they are in.
      
      This isn't correct since each partition's billing can be different.
      
      Set it correctly here.
      3686dd9c
  8. 30 Apr, 2018 3 commits
  9. 28 Apr, 2018 2 commits
  10. 26 Apr, 2018 1 commit
  11. 25 Apr, 2018 2 commits
  12. 23 Apr, 2018 2 commits
  13. 19 Apr, 2018 3 commits
    • Marshall Garey's avatar
      Fix incorrect error thrown when cancelling part of a job array. · 8432f9f6
      Marshall Garey authored
      Fix an issue in the bit manipulation log introduced in commit 892ffa89.
      
      Bug 4997.
      8432f9f6
    • Tim Wickberg's avatar
      Fix 'squeue -o %s' on Cray systems. · d3398004
      Tim Wickberg authored
      Replace select_p_select_jobinfo_sprint() with the same NO-OP
      that the other plugins (except alps and bluegene) have implemented.
      
      Bug 5077.
      d3398004
    • Danny Auble's avatar
      Fix for update job function. · 164d4878
      Danny Auble authored
      Time limit was incorrectly changed when swapping qos. Now we check
      qos/assoc/partition all together and deny the chang if not consistent.
      Fixed how we check coordinator permissions. Rearranged the function to update
      QOS before updating partition in order to enforce AllowQOS and DenyQOS options.
      One memory leak and some comments fixed too.
      
      Bug 4685
      164d4878
  14. 17 Apr, 2018 1 commit
    • Morris Jette's avatar
      Make UnavailableNodes value in job reason be correct for each job · fc4e5ac9
      Morris Jette authored
      1. Identifies nodes which are unavailable to a specific job,
      adding a call to filter_by_node_owner() in select_nodes()
       where the node list is generated.
      2. Removes the "unavail_node_str" argument to select_nodes()
      as it is no longer useful. This string originally was originally
      generated once at the start of the job scheduling logic for all jobs,
      but since each job can have a different set of
       unavailable nodes (dedicated to user, group, etc.)
      so the same string for all jobs can be misleading.
      
      Bug 4932.
      fc4e5ac9