1. 16 Aug, 2017 1 commit
  2. 15 Aug, 2017 4 commits
  3. 14 Aug, 2017 3 commits
  4. 12 Aug, 2017 1 commit
  5. 11 Aug, 2017 5 commits
  6. 10 Aug, 2017 2 commits
  7. 07 Aug, 2017 2 commits
  8. 04 Aug, 2017 6 commits
  9. 03 Aug, 2017 1 commit
    • Morris Jette's avatar
      pack job step I/O race condition fix · 71a34f56
      Morris Jette authored
      Fix I/O race condition on step termination for srun launching multiple
         pack job groups. Without this change application output might be
         lost and/or the srun command might hang after some tasks exit.
      71a34f56
  10. 02 Aug, 2017 4 commits
  11. 01 Aug, 2017 3 commits
  12. 31 Jul, 2017 1 commit
  13. 28 Jul, 2017 3 commits
    • Danny Auble's avatar
      Fix issue when an alternate munge key when communicating on a persistent · 591dc036
      Danny Auble authored
      connection.
      
      Bug 4009
      591dc036
    • Alejandro Sanchez's avatar
      jobcomp/elasticsearch - save state on REQUEST_CONTROL. · 8944b77a
      Alejandro Sanchez authored
      jobcomp/elasticsearch saves/load the state to/from elasticsearch_state.  Since
      the jobcomp API isn't designed with save/load state operations, the plugin
      _save_state() isn't extern and not available from outside the plugin itself,
      thus it is highly coupled to fini() function. This state doesn't follow the
      same execution path as the rest of Slurm states, where in save_all_sate()
      they are all independently scheduled. So we save it manually here on a RPC
      of type REQUEST_CONTROL.
      
      This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup
      which is currently in controller mode, the Backup will save the state and when
      the Primary assumes control again it can process the saved pending jobs.  The
      other way around was already controlled, because when the Primary is running
      in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is
      shutdown and when breaking the ctld main() function while(1) loop, there was
      already a g_slurm_jobcomp_fini() call in place.
      
      Bug 3908
      8944b77a
    • Morris Jette's avatar
      Perform pack job limits check at submit time · 058b99b6
      Morris Jette authored
      Perform limit check on heterogeneous job as a whole at submit time to
         reject jobs that will never be able to run. Accepting pack jobs
         that can never start will have a significant effect on scheduling
         in general (blocking the queue).
      058b99b6
  14. 27 Jul, 2017 1 commit
    • Alejandro Sanchez's avatar
      Fix bug when tracking multiple simultaneous spawned ping cycles · f7463ef5
      Alejandro Sanchez authored
      When more than 1 ping cycle is spawned simultaneously (for instance
      REQUEST_PING + REQUEST_NODE_REGISTRATION_STATUS for the selected nodes),
      we do not track a separate ping_start time for each cycle. When ping_begin()
      is called, the information about the previous ping cycle is lost. Then when
      ping_end() is called for the first of the two cycles, we set ping_start=0,
      which is incorrectly used to see if the last cycle ran for more than
      PING_TIMEOUT seconds (100s), thus incorrectly triggering the:
      
       error("Node ping apparently hung, many nodes may be DOWN or configured "
             "SlurmdTimeout should be increased");
      
      Bug 3914
      f7463ef5
  15. 26 Jul, 2017 3 commits