1. 06 Aug, 2018 1 commit
    • Tim Wickberg's avatar
      Modify slurm_send_only_node_msg() to catch issues with socket. · 06582da8
      Tim Wickberg authored
      There are subtle issues involved in treating a TCP transmission
      as a unidirectional message delivery layer.
      
      The original code path looks like: connect(), write(), close().
      But Linux handles the write() and close() asynchronously behind the
      scenes, and does not block until that write() has been ACK'd by the
      remote end. So the write() and close() may succeed, even with data
      still in flight. A communication error - and message loss - would
      have been silently ignored, leading to unreliable message transmission.
      
      Worse yet, one side of the connection would believe it sent the message,
      while the receive side swears it never saw the packets. This leads to
      infrequent and yet seemingly impossible data loss, and a very tough
      bug to chase down.
      
      This teardown code tries to force the connection to shut down in an
      orderly manner, giving Slurm a chance to catch a connection problem
      and the upstream calling path an opportunity to retransmit.
      
      This teardown code is based on an approach described in Section 7.5
      of "UNIX Network Programming" Volume 1 (Third Edition), specifically
      the subsection regarding SO_LINGER. (And also covers why SO_LINGER is
      not sufficent to prevent this issue.)
      
      Bug 5164.
      06582da8
  2. 04 Aug, 2018 1 commit
  3. 31 Jul, 2018 1 commit
  4. 27 Jul, 2018 2 commits
  5. 19 Jul, 2018 4 commits
  6. 18 Jul, 2018 3 commits
  7. 17 Jul, 2018 3 commits
  8. 13 Jul, 2018 1 commit
  9. 12 Jul, 2018 3 commits
  10. 09 Jul, 2018 1 commit
  11. 06 Jul, 2018 1 commit
    • Marshall Garey's avatar
      Fix leaking freezer cgroups. · 7f9c4f73
      Marshall Garey authored
      Continuation of 923c9b37.
      
      There is a delay in the cgroup system when moving a PID from one cgroup
      to another. It is usually short, but if we don't wait for the PID to
      move before removing cgroup directories the PID previously belonged to,
      we could leak cgroups. This was previously fixed in the cpuset and
      devices subsystems. This uses the same logic to fix the freezer
      subsystem.
      
      Bug 5082.
      7f9c4f73
  12. 04 Jul, 2018 1 commit
  13. 03 Jul, 2018 1 commit
  14. 26 Jun, 2018 4 commits
  15. 25 Jun, 2018 1 commit
  16. 22 Jun, 2018 1 commit
  17. 20 Jun, 2018 1 commit
    • Alejandro Sanchez's avatar
      Make job_start_data() multi partition aware on REQUEST_JOB_WILL_RUN. · 35a13703
      Alejandro Sanchez authored
      Previously the function was only testing against the first partition in
      the job_record. Now it detects if the job request is multi partition and
      if so then loops through all of them until the job will run in any or
      until the end of the list, returning the error code from the last one if
      the job won't run in any partition.
      
      Bug 5185
      35a13703
  18. 19 Jun, 2018 2 commits
  19. 18 Jun, 2018 1 commit
  20. 15 Jun, 2018 2 commits
  21. 12 Jun, 2018 3 commits
  22. 08 Jun, 2018 2 commits