1. 17 Jan, 2013 1 commit
    • Morris Jette's avatar
      Add support for configurable keep alive time for srun/slurmstep communications · 5752c6ce
      Morris Jette authored
      Added "KeepAliveTime" configuration parameter
      
      From Matthieu Hautreux:
      TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
      exchanges, thus letting the system potentially deadlocked if a remote host
      dissapear and the local host is waiting on a read (the write would result in a
      EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
      relatively large timeout value (5 minutes), could enhance the resilience of
      SLURM for unexpected packet/connection loss without too much implication on the
      scalability of the solution. The timeout could be configurable in case it is
      find too aggresive for particular configurations.
      5752c6ce
  2. 16 Jan, 2013 4 commits
  3. 15 Jan, 2013 2 commits
    • Matthieu Hautreux's avatar
      QoS limits enforcement: correct a bug with 0-valued per user used limits · 4136520d
      Matthieu Hautreux authored
      QoS limits enforcement on the controller side is based on a list of used_limits
      per user. When a user is not yet added to the list, which is common when the
      controller is restarted and the user has no running jobs, the current logic is
      to not check some of the "per user limits" and let the submission succeed.
      However, if one of these limits is a zero-valued limit, the check chould
      failed as it means that no job should be submitted at all as it would
      necessarily result in a crossing of the limit.
      
      This patch ensures that even when a user is not yet present in the per user
      used_limits list, the 0-valued limits are correctly treated.
      4136520d
    • David Bigagli's avatar
      Merge priority/multifactor2 plugin into priority/multifactor · 6596f17e
      David Bigagli authored
      Add PriorityFlags value of "TICKET_BASED".
      6596f17e
  4. 14 Jan, 2013 3 commits
  5. 11 Jan, 2013 1 commit
  6. 10 Jan, 2013 3 commits
  7. 09 Jan, 2013 2 commits
  8. 08 Jan, 2013 3 commits
  9. 03 Jan, 2013 5 commits
  10. 28 Dec, 2012 1 commit
  11. 22 Dec, 2012 1 commit
  12. 21 Dec, 2012 3 commits
    • Morris Jette's avatar
      Correct job time limit for sched/backfil and job has QOS with NO_RESERVE flag · 4652e982
      Morris Jette authored
      If sched/backfill starts a job with a QOS having NO_RESERVE and not job
      time limit, start it with the partition time limit (or one year if the
      partition has no time limit) rather than NO_VAL (140 year time limit);
      
      If a standby job, which in this
      case has the NO_RESERVE flag set, is submitted
      without a time limit, and is backfilled, it
      will get an EndTime waaayyyy into the future.
      
      JobId=99 Name=cmdll
         UserId=eckert(1043) GroupId=eckert(1043)
         Priority=12083 Account=sa QOS=standby
         JobState=RUNNING Reason=None Dependency=(null)
         Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
         RunTime=00:00:14 TimeLimit=12:00:00 TimeMin=N/A
         SubmitTime=2012-12-20T11:49:36 EligibleTime=2012-12-20T11:49:36
         StartTime=2012-12-20T11:49:44 EndTime=2149-01-26T18:16:00
      
      so I looked at the code in /src/plugins/sched/backfill:
      
                      if (job_ptr->start_time <= now) {
                              int rc = _start_job(job_ptr, resv_bitmap);
                              if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
                                      job_ptr->time_limit = orig_time_limit;
                                      job_ptr->end_time = job_ptr->start_time +
                                                          (orig_time_limit * 60);
      
      Using the debugger I found that if the job does not have a specified
      time limit, the job_ptr->time_limit is equal to NO_VAL when it hits
      this code.
      4652e982
    • Danny Auble's avatar
    • Morris Jette's avatar
      Added "HealthCheckNodeState" configuration parameter · b139f654
      Morris Jette authored
      Identify node states on which HealthCheckProgram should be executed.
      b139f654
  13. 20 Dec, 2012 4 commits
  14. 19 Dec, 2012 5 commits
  15. 18 Dec, 2012 1 commit
  16. 17 Dec, 2012 1 commit