1. 20 Dec, 2017 9 commits
    • Danny Auble's avatar
      slurm.spec - move libpmi libraries to a separate package. · 5fb94be9
      Danny Auble authored
      The slurm and pmix RPMs both attempt to install a libpmi. While Slurm's
      version should be preferred, the best way to handle this is to have both
      pmix and slurm packages split this off into a separate RPM that can
      conflict with each other, giving the admin a way to resolve the conflict
      manually.
      
      Bug 4511.
      5fb94be9
    • Brian Christiansen's avatar
      Fix reading of variable outside of locks · e3d979c0
      Brian Christiansen authored
      The cluster name could have been free'd before it would be strdup'ed.
      Happens when a federated persisent connection is established and the
      cluster needs to sync up with the remote cluster.
      
      Bug 4503
      e3d979c0
    • Brian Christiansen's avatar
      Sync origin job with cancelled sibling job · 852c2bf3
      Brian Christiansen authored
      Remove the origin job when a remote sibling job was cancelled while the
      origin was down.
      
      e.g.
      job originated on cluster1
      job runs on cluster2
      cluster1 goes down
      job is cancelled on cluster2
      cluster1 comes backup. The origin job should be removed if the sibling
      job was cancelled.
      852c2bf3
    • Tim Wickberg's avatar
      Set the user_name field if not provided. · 45be4cc0
      Tim Wickberg authored
      The user_name is only populated if LaunchParameters=send_gids is set.
      Without this field set, the pam_setup call will get NULL instead of the
      user_name, leading to batch job launch failures. (The step launch code
      already handles this same issue separately.)
      
      Bug 4412.
      45be4cc0
    • Alejandro Sanchez's avatar
      Fix for record job state on successful allocation but failed reply message. · 8dd79221
      Alejandro Sanchez authored
      On a job [pack]allocation RPC request, if the allocation succeed but the
      send response message back to the client failed (i.e. srun was killed before
      it could receive the response), then modify the job_record pointer so that
      the job_state is set to FAILED, the exit_code as if the job got a SIGTERM
      signal and the state_reason to FAIL_LAUNCH. Then users when querying
      the job with sacct can discern that something bad happend for this scenario,
      instead of STATE being showed as COMPLETED and the ExitCode as 0:0.
      
      Bug 4513.
      8dd79221
    • Felip Moll's avatar
      Prevent possible double-xfree on buffer in stepd_completion. · 973ac201
      Felip Moll authored
      Use FREE_NULL_BUFFER instead, otherwise we could attempt to
      free_buffer this a second time if we jump to the rwfail label.
      
      Bug 4484.
      973ac201
    • Felip Moll's avatar
      Fix the display for NNodes in sacct output when using --units option. · 763e396e
      Felip Moll authored
      When printing fields in sacct with user specified units (--units), the
      nnodes field showed an incorrect string. This commit reverts a65fa572
      and avoids the unit conversion, which does not make sense outside the
      context of a Blue Gene systems (deprecated) anyways.
      
      Bug 4490.
      763e396e
    • Felip Moll's avatar
      slurm.spec - fix issue with rpm 4.13+ / Fedora 25+. · a9388ec7
      Felip Moll authored
      Slurm may generate empty manifest files depending on configuration
      and library availability. Disable the new empty manifest check to
      allow builds to proceed with rpm 4.13+ / Fedora 25+.
      
      Bug 4453.
      a9388ec7
    • Morris Jette's avatar
      Add news entry for 3552b25f. · 313a6cca
      Morris Jette authored
      313a6cca
  2. 19 Dec, 2017 5 commits
  3. 18 Dec, 2017 2 commits
  4. 15 Dec, 2017 4 commits
  5. 14 Dec, 2017 1 commit
  6. 13 Dec, 2017 1 commit
  7. 12 Dec, 2017 1 commit
    • Brian Christiansen's avatar
      Update jobcomp records on duplicate inserts · 003f65c3
      Brian Christiansen authored
      In the federation case, the origin job is completed in the database when
      a sibling job starts the job. The complete message is then sent again to
      the database when the job is completed on the sibling cluster but
      it is updated with the sibling job's exit code.
      
      The jobcomp plugin didn't handle the multiple updates to the record.
      This change allows the existing record to be updated.
      
      Bug 4493
      003f65c3
  8. 11 Dec, 2017 2 commits
  9. 08 Dec, 2017 2 commits
    • Danny Auble's avatar
      Fix Slurm to work correctly with HDF5 1.10+. · 006d172a
      Danny Auble authored
      In 1.10+ they changed the hid_t from an int to a long int which
      messes things up as they use the top 32 bits for stuff right off
      the bat.  This fixes the scenario by handing the number with a int32_t
      instead of an int.
      
      Bug 3795
      006d172a
    • Morris Jette's avatar
      Fix node reboot timeout problem · 41c6ac6a
      Morris Jette authored
      Fix potential node reboot timeout problem for "scontrol reboot" command.
      bug 4203
      41c6ac6a
  10. 07 Dec, 2017 4 commits
  11. 06 Dec, 2017 3 commits
    • Danny Auble's avatar
      Wait to start a batch step · a42d325f
      Danny Auble authored
      until the prolog and extern step are fully ran/launched.
      
      Only matters if running with PrologFlags=[contain|alloc].
      
      patch 2 of 2
      
      Bug 4458
      a42d325f
    • Danny Auble's avatar
      Only say a prolog is done running after the extern step is launched. · 71e89f7b
      Danny Auble authored
      Patch 1 of 2
      
      Bug 4458
      71e89f7b
    • David Gloe's avatar
      Replace slurm_playbook.yaml.in with a version that auto-detects the Slurm version. · d54eb247
      David Gloe authored
      Due to the way Cray builds Slurm, the prefix and bindir paths include the
      Slurm version (/opt/slurm/<version>). This means every time we update to a
      new Slurm version we must update the Slurm ansible playbook. It also means
      that the slurm_playbook.yaml file must be built with Slurm to be used
      (it can't simply be copied directly).
      
      The attached patch updates the playbook to determine the version of Slurm
      to use from the module file, and hardcodes the sysconfdir setting we give
      in our Slurm installation guide. If a customer uses different paths, they
      can update the playbook to meet their needs.
      
      Bug 4360.
      d54eb247
  12. 05 Dec, 2017 6 commits