1. 20 Dec, 2019 3 commits
  2. 19 Dec, 2019 1 commit
  3. 18 Dec, 2019 3 commits
    • Douglas Wightman's avatar
      Fix incorrect SLURM_CLUSTER_NAME env var in batch step · 910c38e2
      Douglas Wightman authored
      In a multi-cluster environment a job may submit more jobs as part of
      its workflow. This fixes situations where the variable is inherited
      incorrectly on sub-jobs.
      
      Bug 7998
      910c38e2
    • Marshall Garey's avatar
      Ensure x11 is setup before launching a job step · 71df4fae
      Marshall Garey authored
      srun waits for the prolog to finish before launching a job step.
      In _is_prolog_finished(), slurmctld checks the state reason:
      
      	if (job_ptr) {
      		is_running = (job_ptr->state_reason != WAIT_PROLOG);
      	}
      
      But if the job is updated during the job prolog, then _update_job() will
      change the state_reason, and then slurmctld will tell srun that the
      prolog is completed even if it isn't. If srun launches a job step before
      the extern sets up x11, then the job step won't have x11 information. To
      fix this, don't change state_reason in _update_job() if it equals
      WAIT_PROLOG.
      
      Bug 7525
      71df4fae
    • Douglas Wightman's avatar
      Fix for requesting specific nodes when using cons_tres topology. · d0bf6d68
      Douglas Wightman authored
      This in turn fixes allocation requests that weren't rejected and they
      should because the requested nodes didn't have a shared network.
      
      Bug 8210
      d0bf6d68
  4. 17 Dec, 2019 1 commit
  5. 16 Dec, 2019 1 commit
  6. 10 Dec, 2019 1 commit
  7. 09 Dec, 2019 1 commit
  8. 02 Dec, 2019 1 commit
  9. 26 Nov, 2019 2 commits
  10. 21 Nov, 2019 2 commits
    • Alejandro Sanchez's avatar
      Fix misleading error for immediate alloc requests and defer combination. · 1b13f532
      Alejandro Sanchez authored
      When an allocation request was done with the immediate=1 argument and
      SchedulerParameters included defer, Slurm was returning a misleading
      ESLURM_FRAGMENTATION error. Logic now a returns a more appropriate
      ESLURM_CAN_NOT_START_IMMEDIATELY error for this scenario by decoupling
      defer from the too fragmented logic in job_allocate().
      
      Note that this doesn't change behavior as immediate + defer combination
      continues having defer as the king in terms of precedence order, meaning
      individual submit time allocation attempts will be avoided independently
      of immediate.
      
      Bug 5175.
      1b13f532
    • Marshall Garey's avatar
      Reject unrunnable jobs submitted to reservations. · ab52c868
      Marshall Garey authored
      This effectively reverts commit 73351553. That commit's message is,
      
           "Improve support for overlapping advanced reservations.
            Patch from Bill Brophy, Bull."
      
      Jobs submitted to reservations that request more resources than are on a
      node will pend forever because of that commit. Reverting that commit
      causes those jobs to be immediately rejected. Also, that commit doesn't
      appear to "improve support for overlapping advanced reservations" in any
      way.
      
      The job is already immediately rejected if it asks for more resources
      than are on a node without being submitted to a reservation, or if the
      job asks for more nodes than are currently in the reservation. So, this
      commit just makes behavior consistent.
      
      Bug 5175.
      ab52c868
  11. 15 Nov, 2019 1 commit
    • Michael Hinton's avatar
      Fix both socket-[un]constrained GRES allocation issues. · efcd853a
      Michael Hinton authored
      Do not assume that these sock_gres_t pointers always exist:
      bits_by_sock
      bits_by_sock[s]
      
      If they don't, that means there are no current iteration socket `s`
      constrained GRES and so the logic shouldn't allocate the current
      iteration GRES `g`.
      
      Analogously, do not assume that bits_any_sock sock_gres_t member pointer
      is always valid. If it isn't, it means there are no socket-unconstrained
      GRES available to satisfy the job request, so the logic should not
      allocate the current iteration GRES `g`.
      
      Otherwise, job/node struct members holding GRES allocation information
      would end up being incorrect, leading to improper allocations and also
      leading to errors logged in slurmctld log at deallocation time like:
      
      error: gres/gpu: job <X> dealloc node <Y> GRES count underflow (0 < 1)
      
      Bug 7827
      efcd853a
  12. 14 Nov, 2019 1 commit
  13. 12 Nov, 2019 2 commits
    • Marcin Stolarek's avatar
      Initialize db_flags correctly in slurmdb_unpack_job_cond(). · 6158e479
      Marcin Stolarek authored
      For older RPCs we should initialize db_flags with SLURMDB_JOB_FLAG_NOTSET.
      (Which is treated differently than SLURMDB_JOB_FLAG_NONE, which is 0.)
      
      Bug 8029.
      6158e479
    • Dominik Bartkiewicz's avatar
      Fix regression caused by c55f6d65. · 4c1ed636
      Dominik Bartkiewicz authored
      Remove the TIME_FLOAT flag from the reservation to ensure _job_overlap()
      does not add the current time on top of the start_time. The prior
      approach was incorrect for non-TIME_FLOAT reservations and would
      lead to valid reservations being rejected.
      
      Bug 7458, 7908.
      4c1ed636
  14. 11 Nov, 2019 2 commits
  15. 08 Nov, 2019 2 commits
    • Michael Hinton's avatar
      Fix issues with --gpu-bind while using cgroups · 5b13fbb3
      Michael Hinton authored
      CUDA_VISIBLE_DEVICES was not being set to the correct GPU indexes when
      cgroups were being used. These issues were exhibited with at least the
      map_gpu and mask_gpu binding options.
      
      The issue was that usable_gres is a bitmask of GRESs in the step's
      cgroup, but bit_test() was looking at bit i, which is the index of the
      global gres_list (not constrained by cgroups).
      
      Bug 7509
      5b13fbb3
    • Felip Moll's avatar
      Fix regression on update from older versions with DefMemPerCPU · 6abe1e75
      Felip Moll authored
      In 19.05 JOB_MEM_SET flag was added along with a conditional check on
      this flag that changed the pn_min_memory when validating job limits.
      This caused that after an upgrade, PD jobs in earlier versions didn't
      have this flag and the memory was incorrectly set when their limits were
      checked before starting. The patch here addresses this issue adding this
      flag to jobs from an older protocol version when loading the state
      files.
      
      Bug 8011
      6abe1e75
  16. 07 Nov, 2019 1 commit
    • Marshall Garey's avatar
      Allow coordinators to delete users. · 0d579734
      Marshall Garey authored
      Previously, coordinators could delete specific associations, but could
      not delete users. Allow coordinators to delete users if the users are
      only part of accounts that the coordinator is over.
      
      Bug 7413.
      0d579734
  17. 31 Oct, 2019 5 commits
  18. 29 Oct, 2019 1 commit
  19. 28 Oct, 2019 2 commits
  20. 25 Oct, 2019 2 commits
    • Albert Gil's avatar
      Enforce PART_NODES if only Partition is specified · c8ce5a53
      Albert Gil authored
      Bug 7490
      c8ce5a53
    • Marshall Garey's avatar
      Avoid abort in dev-build · fe945037
      Marshall Garey authored
      If not enforcing QOS, it's possible to submit a job without a qos. If
      submitting such a job to multiple partitions where at least one has a
      qos, slurmctld would abort in a development build. A non-development
      build didn't segfault only because _find_qos_part doesn't dereference
      the NULL pointer. Prevent the abort.
      
      Bug 7171
      fe945037
  21. 24 Oct, 2019 1 commit
  22. 23 Oct, 2019 1 commit
  23. 22 Oct, 2019 2 commits
    • Gavin Howard's avatar
      Fix abort initializing a configuration without acct_gather.conf. · a301635f
      Gavin Howard authored
      Previous logic would only call s_p_hashtbl_create() to create the hashtable
      when the file acct_gather.conf could be successfully stat()'d. This lead to
      a subsequent attempt to pack the non-created hashtable into a buffer which
      triggered the abort.
      
      This makes it so the hashtable is uncondtionally created no matter if the
      file is missing.
      
      Bug 7893.
      a301635f
    • Michael Hinton's avatar
      auth/munge - truncate FQDN to shortname for AllocNodes. · 50eaa012
      Michael Hinton authored
      gethostbyaddr() can potentially return a fully-qualified domain name,
      which breaks backwards compatibility with the shortname AllocNodes
      expected pre 19.05.
      
      Bug 7653.
      50eaa012
  24. 21 Oct, 2019 1 commit