1. 22 Feb, 2018 6 commits
    • Felip Moll's avatar
      Only launch a single io_timeout_thread · 738890aa
      Felip Moll authored
      Only a single io_timeout_thread should be created for each sls struct.
      
      Creating multiple, while seemingly harmless in operation, can lead to
      fatal() messages when srun shuts down by destroying mutex locks that
      are in use by threads that srun doesn't expect to still have running.
      
      Regression caused by a1185f04.
      
      Bug 4596
      738890aa
    • Morris Jette's avatar
      Continuation of b564ef0a for newly created reservations. · 2adde3cb
      Morris Jette authored
      Bug 4806.
      2adde3cb
    • Felip Moll's avatar
      Preserve and fix node features on reconfig or restart · e58f5123
      Felip Moll authored
      This patch fixes the situation that makes features unrecognized where a node
      features plugin is active and features are defined to nodes in slurm.conf.
      
      It also preserves KNL node features when slurmctld daemons are reconfigured
      including active and available modes.
      
      Features not belonging to node features plugin are reset to what is in
      slurm.conf when restarting or reconfiguring.
      
      Bug 4734
      e58f5123
    • Alejandro Sanchez's avatar
      Make MAINT and OVERLAP flags order agnostic on overlap test. · b564ef0a
      Alejandro Sanchez authored
      _resv_overlap function was only checking the flags for the updated
      reservation, but not for the rest of present ones. This implied
      that the allowed overlap derived from these flags only applied
      depending on the update order.
      
      Bug 4806.
      b564ef0a
    • Alejandro Sanchez's avatar
      Requeue allocated jobs on nodes requested to DRAIN if POWER_[SAVE|UP]. · 14596246
      Alejandro Sanchez authored
      After commit b31fa177, we do not defer slurmd node registration if
      HealthCheckProgram fails. So at slurmd startup, slurmd executes:
      
      run_script_health_check();
      _spawn_registration_engine();
      
      And does not keeps spinning if NHC fails. Now if there are nodes
      managed by the Power Save logic, when they are requested to be
      POWER_UP because a job is allocated resources, then at slurmd startup
      NHC is executed before node registers.
      
      The problem comes when this NHC execution fails, if the NHC program
      decides to update the node to DRAIN, since the job was already
      allocated before this update, then the job will attempt to start
      RUNNING but might fail since NHC detected there's something wrong.
      
      So this change what it does is to detect DRAIN/FAIL node update
      requests, then check if node is ALLOC/MIXED and POWER_[SAVE|UP] and
      if so then force a requeue, so that the job doesn't start on a failed
      node.
      
      Bug 4689.
      14596246
    • Felip Moll's avatar
      Move a warning to debug() from error() on PSS stat collection error. · 10c90b25
      Felip Moll authored
      Can frequently throw scary-sounding messages on short-lived processes
      that disappear while the stats are collected.
      
      Bug 4759.
      10c90b25
  2. 21 Feb, 2018 15 commits
  3. 20 Feb, 2018 7 commits
  4. 16 Feb, 2018 3 commits
  5. 15 Feb, 2018 3 commits
  6. 14 Feb, 2018 2 commits
  7. 13 Feb, 2018 4 commits