1. 04 Aug, 2017 2 commits
  2. 02 Aug, 2017 2 commits
    • Marshall Garey's avatar
      Fix starting ctld w/out existing StateSaveLocation · ec78d45a
      Marshall Garey authored
      Would fail when trying to create the clustername file because the
      StateSaveLocation path didn't exist yet.
      
      Bug 3988
      ec78d45a
    • Marshall Garey's avatar
      Fix srun jobs to run in high prio partition · 948de46b
      Marshall Garey authored
      srun jobs that could start immediately and requested multiple partitions
      didn't run in the highest priority partition if the highest priority
      partition wasn't listed first.
      
      It's possible that the scontrol show jobs will show the partition list
      in priority order now that the job's partition list gets sorted by
      priority.
      
      Bug 4015
      948de46b
  3. 01 Aug, 2017 2 commits
  4. 31 Jul, 2017 1 commit
  5. 28 Jul, 2017 2 commits
    • Danny Auble's avatar
      Fix issue when an alternate munge key when communicating on a persistent · 591dc036
      Danny Auble authored
      connection.
      
      Bug 4009
      591dc036
    • Alejandro Sanchez's avatar
      jobcomp/elasticsearch - save state on REQUEST_CONTROL. · 8944b77a
      Alejandro Sanchez authored
      jobcomp/elasticsearch saves/load the state to/from elasticsearch_state.  Since
      the jobcomp API isn't designed with save/load state operations, the plugin
      _save_state() isn't extern and not available from outside the plugin itself,
      thus it is highly coupled to fini() function. This state doesn't follow the
      same execution path as the rest of Slurm states, where in save_all_sate()
      they are all independently scheduled. So we save it manually here on a RPC
      of type REQUEST_CONTROL.
      
      This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup
      which is currently in controller mode, the Backup will save the state and when
      the Primary assumes control again it can process the saved pending jobs.  The
      other way around was already controlled, because when the Primary is running
      in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is
      shutdown and when breaking the ctld main() function while(1) loop, there was
      already a g_slurm_jobcomp_fini() call in place.
      
      Bug 3908
      8944b77a
  6. 27 Jul, 2017 1 commit
    • Alejandro Sanchez's avatar
      Fix bug when tracking multiple simultaneous spawned ping cycles · f7463ef5
      Alejandro Sanchez authored
      When more than 1 ping cycle is spawned simultaneously (for instance
      REQUEST_PING + REQUEST_NODE_REGISTRATION_STATUS for the selected nodes),
      we do not track a separate ping_start time for each cycle. When ping_begin()
      is called, the information about the previous ping cycle is lost. Then when
      ping_end() is called for the first of the two cycles, we set ping_start=0,
      which is incorrectly used to see if the last cycle ran for more than
      PING_TIMEOUT seconds (100s), thus incorrectly triggering the:
      
       error("Node ping apparently hung, many nodes may be DOWN or configured "
             "SlurmdTimeout should be increased");
      
      Bug 3914
      f7463ef5
  7. 26 Jul, 2017 3 commits
  8. 24 Jul, 2017 2 commits
  9. 21 Jul, 2017 3 commits
  10. 19 Jul, 2017 3 commits
  11. 18 Jul, 2017 1 commit
    • Dominik Bartkiewicz's avatar
      Fix issue with multiple jobs from an array to start. · b40bd8d3
      Dominik Bartkiewicz authored
      By removing the real locks we can get into a race condition where the prolog
      starts and finishes before we get here and then we end up waiting forever.
      
      Making the mutex a static seemed to help in many cases, but didn't
      completely close the window.  Changing slurm_cond_wait to
      slurm_cond_timedwait fixed the scenario where we would hit the window, but
      not degrade performance the original commit provides.
      
      There were also spots where if the job or step didn't exist it wouldn't
      signal the conditional also providing a spot this could get stuck not
      starting the job.
      
      Fix regression from commit 52ce3ff0
      
      Bug 3977
      b40bd8d3
  12. 14 Jul, 2017 1 commit
    • Danny Auble's avatar
      Fix issue with whole gres not being printed out with Slurm tools. · 028bf3e1
      Danny Auble authored
      This is a regression from commit fec995e0.
      
      It turns out using tok here was erroneous for situations where the gres had
      a type and name and potentially a count (i.e. network:gigabit:1)
      
      _get_gres_req_cnt() would alter the incoming char *config which is what tok
      was.  So when we print it back to the requested string it would only have
      what was there to the first ':'.  As we didn't need to \0 out the first char
      as we skip over it anyway I just kept track of what the replaced \0 was for
      the number portion and put it back when we are done copying it.
      
      Related to bug 3521
      028bf3e1
  13. 13 Jul, 2017 6 commits
  14. 07 Jul, 2017 4 commits
  15. 05 Jul, 2017 1 commit
  16. 30 Jun, 2017 2 commits
    • Alejandro Sanchez's avatar
      Burst buffer size unit changes · 7e161809
      Alejandro Sanchez authored
      burst_buffer logic modified to support sizes in both SI and EIC size units
          (e.g. M/MiB for powers of 1024, MB for powers of 1000).
      bug 3922
      7e161809
    • Dominik Bartkiewicz's avatar
      Fix potential to corrupt DBD message. · 3e00ede5
      Dominik Bartkiewicz authored
      This patch removes a window in which a message bound for the DBD
      could be packed with the non-dbd packing.  This would result in a
      packed msg_type, but nothing else.  When that message was given to
      the DBD it would complain forever about an unpacking error.
      
      Bug 3891 and 3939
      3e00ede5
  17. 29 Jun, 2017 1 commit
  18. 28 Jun, 2017 3 commits