1. 18 Mar, 2016 1 commit
    • Morris Jette's avatar
      Fix for srun abort on SIGSTOP+SIGCONT · 1ed38f26
      Morris Jette authored
      Avoid possibly aborting srun that gets simultaneous SIGSTOP+SIGCONT while
          creating the job step. The result is that the signal hanlder gets a
          argument (the signal received) of zero.
      
      Here's a log, window 1:
      $ srun hostname
      srun: Job step creation temporarily disabled, retrying
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 18
      srun: I Got signal 0
      srun: Cancelled pending job step
      
      Window 2:
      $  kill -STOP 18696 ; kill -CONT 18696
      $  kill -STOP 18696 ; kill -CONT 18696
      $  kill -STOP 18696 ; kill -CONT 18696
      ....
      
      bug 2494
      1ed38f26
  2. 17 Mar, 2016 2 commits
    • Morris Jette's avatar
      Change calculation of node's allocated CPUs · ec50cb2f
      Morris Jette authored
      Change how a node's allocated CPU count is calculated to avoid double
          counting CPUs allocated to multiple jobs at the same time.
          Previous logic would sum the maximum number of CPUs allocated by each
          partition for any time slice, which could double count CPUs allocated
          to multiple jobs. New logic ORs bitmap of allocated CPUs for every
          partition and time slice, then counts the total for a given node.
          This avoids double counting CPUs allocated to multiple jobs, but
          does not remove from the count CPUs which have been allocated to
          jobs which might be suspended by the gang scheduler (either for
          time slicing or preemption).
      ec50cb2f
    • Tim Wickberg's avatar
      Prevent uid update from corrupting assoc_hash table. · 60b58b70
      Tim Wickberg authored
      The uid is used as part of the hash function, must remove old reference
      and recalculate if it may change, otherwise _delete_assoc_hash
      will not find it again when the association is removed, causing
      slurmctld to segfault.
      
      Bug 2560.
      60b58b70
  3. 16 Mar, 2016 8 commits
  4. 15 Mar, 2016 2 commits
  5. 14 Mar, 2016 2 commits
  6. 12 Mar, 2016 1 commit
  7. 11 Mar, 2016 3 commits
  8. 10 Mar, 2016 5 commits
    • Morris Jette's avatar
      cray job requeue bug · 536c8451
      Morris Jette authored
      Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
      allocated to a requeued job as non-usable on job termination.
      
      Specifically, each job has a "cleaning/cleaned" flag. Once a job
      terminates, the cleaning flag is set, then after the job node health
      check completes, the value gets set to cleaned. If the job is requeued,
      on its second (or subsequent) termination, the select/cray plugin
      is called to launch the NHC. The plugin sees the "cleaned" flag
      already set, it then logs:
      error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
      and returns, never launching the NHC. Since the termination of the
      job NHC triggers releasing job resources (CPUs, memory, and GRES),
      those resources are never released for use by other jobs.
      
      Bug 2384
      536c8451
    • David Gloe's avatar
      Correctly parse nids in slurmconfgen_smw.py · e050806e
      David Gloe authored
      An error in slurmconfgen_smw.py caused it to parse the nic as the nid.
      On some systems those values differ, causing the generated slurm.conf file to
      be incorrect.
      
      Bug 2532.
      e050806e
    • Bill Brophy's avatar
      Fix route/topology plugin to prevent segfault in sbcast. · 0dfc924c
      Bill Brophy authored
      route_p_split_hostlist was not thread-safe, and would cause
      one of several segfaults depending on where in the initialization
      code each thread was.
      
      Bug 2495.
      0dfc924c
    • Tim Wickberg's avatar
      Fix displayed value for RoutePlugin. · db8491f1
      Tim Wickberg authored
      Was incorrectly displaying "(null)" even when loaded successfully.
      db8491f1
    • Morris Jette's avatar
      Add NEWS for commit 3bb2e602 · a0be0dc5
      Morris Jette authored
      a0be0dc5
  9. 09 Mar, 2016 2 commits
    • Morris Jette's avatar
      cray job requeue bug · fec5e03b
      Morris Jette authored
      Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
      allocated to a requeued job as non-usable on job termination.
      
      Specifically, each job has a "cleaning/cleaned" flag. Once a job
      terminates, the cleaning flag is set, then after the job node health
      check completes, the value gets set to cleaned. If the job is requeued,
      on its second (or subsequent) termination, the select/cray plugin
      is called to launch the NHC. The plugin sees the "cleaned" flag
      already set, it then logs:
      error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
      and returns, never launching the NHC. Since the termination of the
      job NHC triggers releasing job resources (CPUs, memory, and GRES),
      those resources are never released for use by other jobs.
      
      Bug 2384
      fec5e03b
    • David Gloe's avatar
      Correctly parse nids in slurmconfgen_smw.py · 88ccc111
      David Gloe authored
      An error in slurmconfgen_smw.py caused it to parse the nic as the nid.
      On some systems those values differ, causing the generated slurm.conf file to
      be incorrect.
      
      Bug 2532.
      88ccc111
  10. 08 Mar, 2016 2 commits
  11. 07 Mar, 2016 1 commit
  12. 05 Mar, 2016 2 commits
  13. 04 Mar, 2016 3 commits
  14. 03 Mar, 2016 5 commits
    • Thomas Hamel's avatar
      Defer slurmd registration until NodeHealthCheck · 7fb0c981
      Thomas Hamel authored
      We want to introduce a new behavior in the way slurmd uses the
      HealthCheckProgram. The idea is to avoid a race condition between the
      first HealthCheckProgram run and the node accepting jobs. The slurmd
      daemon will initialize and then loop on HealthCheckProgram execution
      before registering with slurmctld. It will stay in this loop until
      the HealthCheckProgram returns successfully (the node is still DOWN).
      
      On our clusters we are using NHC as an HealthCheckProgram. NHC drains
      the node if it fails and remove the drain if it is successfull, this
      behavior fits well with our purpose. This behavior permits us to start
      slurmd at boot without setting up a complex boot sequence in the init
      system, slurmd just wait for the node to be ready before registering.
      
      The HealthCheckProgram is not run during slurmd startup if
      HealthCheckInteval is 0.
      7fb0c981
    • Danny Auble's avatar
      72f13426
    • Brian Christiansen's avatar
      5c43d754
    • Morris Jette's avatar
      Increase step GRES variable size · 7f0bdc84
      Morris Jette authored
      Step GRES value changed from type "int" to "int64_t" to support larger
      values. Previous logic could fail in step allocation values over 32-bits.
      Other GRES values are 64-bit.
      7f0bdc84
    • Danny Auble's avatar
      Force close on exec on first 256 file descriptors when launching a · f502f1e5
      Danny Auble authored
      slurmstepd to close potential open ones.
      
      It was pointed out the slurmd using acct_gather_energy/ipmi links to
      freeipmi which could possibly open /dev/ipmi0 without the close on exec
      flag set as root while launching a step leaving it open in the users app.
      
      What this does is sets the flag on the first 256 to mitigate the concern.
      
      Reported by Maksym Planeta.
      
      Bug 2506
      f502f1e5
  15. 02 Mar, 2016 1 commit