1. 20 Aug, 2018 21 commits
  2. 18 Aug, 2018 6 commits
  3. 17 Aug, 2018 13 commits
    • Morris Jette's avatar
      This patch is for v17.11 and fixes several problems: · 94df0b8c
      Morris Jette authored
      1. The cpu frequency set by the user is not exact with
         current kernels, but close. This changes the logic accordingly.
      2. The original logic would cause the test to hang indefinitely
         if the submitted job never ends. This adds timeout checks
         on the job wait, plus adds a 1 minute time limit on the job.
      3. Improve/simplify the parsing logic.
      
      Bug 5584
      94df0b8c
    • Brian Christiansen's avatar
      Remove drain on node when reboot nextstate used · 11220088
      Brian Christiansen authored
      Currently only valid nextstate states are down and resume/idle so the
      node shouldn't be in a drain state after transitioning into either of
      these states.
      
      Bug 5544
      11220088
    • Tim Wickberg's avatar
      Merge branch 'slurm-18.08' · 6e1d24fa
      Tim Wickberg authored
      6e1d24fa
    • Tim Wickberg's avatar
      Use attached threads with pthread_join(). · b6c569f3
      Tim Wickberg authored
      This also clears up a potential race around ping_thread_cnt
      as it was protected by ping_mutex in one location and shutdown_mutex
      in another.
      b6c569f3
    • Tim Wickberg's avatar
      Move node_name_short and node_name_long into slurmctld_conf struct. · 625a786a
      Tim Wickberg authored
      - Do not look it up again in the backup controller.
      - In the backup controller, stop comparing it at all and instead
        use the backup_idx value to decide if we're outselves.
      625a786a
    • Tim Wickberg's avatar
      5c5da0c2
    • Tim Wickberg's avatar
      Remove RESPONSE_SLURM_RC message handling. · 27f1f2ef
      Tim Wickberg authored
      This is not a valid response here - backup and primary
      must always be running the same version, so do not attempt to
      handle this here.
      27f1f2ef
    • Tim Wickberg's avatar
      Only ping higher priority controllers. · d5ebe6b9
      Tim Wickberg authored
      There's no point in pinging controllers with a lower priority
      than yourself - they'll already be pinging you. As we did nothing
      with that data, don't bother to collect it, especially as lower
      priority controllers being unavailable will delay the next pass
      through this loop.
      d5ebe6b9
    • Tim Wickberg's avatar
      Call _controller_index() in one location to set backup_inx. · dd12326c
      Tim Wickberg authored
      Reference backup_inx directly after startup, and exit much
      earlier if this host is not a valid controller. Return a
      non-zero exit code in this situation as well.
      dd12326c
    • Tim Wickberg's avatar
      390a76d2
    • Tim Wickberg's avatar
      Consolidate and harden SlurmctldHost and ControlMachine config parsing. · da33c488
      Tim Wickberg authored
      Collapse into a single function so we can appropriately warn
      if a mix of options are in use.
      
      This also avoids a confusing-looking xmalloc with the count padded by two,
      which was being used to build out space for ControlMachine if SlurmctldHost
      was not defined. This would have also masked off a series of off-by-one
      errors, and has lead to attempts to connect to 0.0.0.0 instead of a segfault.
      (Some code was intentionally using this over-provisioning as a way to
      treat this as a NULL-terminated list, but this was then technically
      incorrect in cases where the old-style BackupController was set since the
      NULL would happen at the third position in the array, which is an invalid
      memory access.)
      da33c488
    • Tim Wickberg's avatar
      Refactor and combine _backup_index and _valid_controller into _controller_index(). · dcd9b0d1
      Tim Wickberg authored
      And document why these are handled the way they are here.
      dcd9b0d1
    • Tim Wickberg's avatar
      Fix incorrect order of operations. · 3d0fff75
      Tim Wickberg authored
      This results in an out-of-bounds access (if control_machine was not
      being intentionally over-alloced to avoid it), the wrong address,
      and other subtle problems.
      
      C's order of operations meant this was resolving as:
      i = (_backup_index() != -1);
      which is either 0 or 1.
      
      Through sheer luck, this still results in the correct answer for the primary
      (_backup_index() is -1, and then i = (-1 != -1) is still 0 which is correct),
      and first backup controller (_backup_index() is 1, and then i = (1 != -1) is
      still 1 which is also correct), but any further backups controllers will end
      up with the address of the first backup.
      3d0fff75