1. 23 Jan, 2013 1 commit
    • jette's avatar
      In select/cons_res, correct logic when job removed from only some nodes. · eb3c1046
      jette authored
      I run into a problem with slurm-2.5.1 that IDLE nodes can not be
      allocated to jobs. This can be reproduced as follows:
      
      First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to
      allocate nodes exclusively by default). Then set one of the nodes
      allocated to the job(cn2) to state DOWN:
      
      srun: error: Node failure on cn2
      srun: error: Node failure on cn2
      srun: error: cn2: task 0: Killed
      ^Csrun: interrupt (one more within 1 sec to abort)
      srun: task 1: running
      srun: task 0: exited abnormally
      ^Csrun: sending Ctrl-C to job 22605.0
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      srun: Force Terminated job step 22605.0
      
      Then change state of the node to IDLE again. But it can not be allocated
      to jobs:
      
      srun: job 22606 queued and waiting for resources
      
        JOBID PARTITION     NAME     USER  ST       TIME  NODES
      NODELIST(REASON)
        22606      work hostname     root  PD       0:00      1 (Resources)
        22604      work   sbatch     root   R       3:06      1 cn1
      
      NodeName=cn2 Arch=x86_64 CoresPerSocket=8
         CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc
         Gres=(null)
         NodeAddr=cn2 NodeHostName=cn2
         OS=Linux RealMemory=30000 Sockets=2 Boards=1
         State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
         BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32
         CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      
      I traced and located the problem in select/cons_res. The call sequence
      is:
      
      slurmctld/node_mgr.c: update_node() =>
      slurmctld/job_mgr.c: kill_running_job_by_node_name() =>
      excise_node_from_job() =>
      plugins/select/cons_res/select_cons_res.c: select_p_job_resized() =>
      _rm_job_from_one_node() => _build_row_bitmaps() =>
      common/job_resources: remove_job_from_cores()
      
      If there are other jobs running in the partition, the partition row
      bitmap will not be set correctly. In the example above, before
      _build_row_bitmaps(), output of _dump_part() is:
      
      [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
      [2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-63
      
      after setting the node down, output of _dump_part() is
      
      [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
      [2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-47
      
      Cores of cn2 are not marked as available. Instead, cores of other nodes
      are released. When another job requires the node cn2, the following log
      message appears:
      
      [2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy
      
      I do not understand the design of select/cons_res well and I do not know
      how to fix this. But it seems that _build_row_bitmaps() should not be
      called, since the job is not removed totally, but only one of the nodes
      released.
      eb3c1046
  2. 22 Jan, 2013 1 commit
  3. 18 Jan, 2013 3 commits
    • Morris Jette's avatar
      Fix topology/tree logic when nodes defined in slurm.conf get re-ordered · 29df4c83
      Morris Jette authored
      From Chris Holmes, HP:
      After several days of brainstorming and debugging, I have identified
      a bug in SLURM 2.5.0rc2, related to the 'tree' topology. It was so
      early in the execution of the whole SLURM machinery that it took me
      some time to figure it out (say, 100 or 200 jobs showing the issue,
      with more or less debugging levels increased and extra
      instrumentation, with sometimes an uncertain reliability)...
      
      For every “switch” a bitmap of nodes (seen down by the switch) is
      built as the topology is discovered through 'topology.conf'.
      
      There is code in read_config.c, executed when the SLURM control
      daemon starts, that reorders the nodes (according to their hostname
      by default), while the switches table (ie the bitmaps) has already
      being built. To reorder the nodes means that the bitmaps of the switches become wrong.
      29df4c83
    • Morris Jette's avatar
      Make more variables available to job_submit/lua plugin · 28740196
      Morris Jette authored
      slurm.MEM_PER_CPU, slurm.NO_VAL, etc.
      28740196
    • Phil Eckert's avatar
      Permit job with invalid QOS to run if QOS set by administrator · 7aef4f80
      Phil Eckert authored
      About a year ago I submitted a modification that you incorporated
      into SLURM 2.4, which was to allow an admin to modify a job to use
      a QOS even though the user did not have access to the QOS.
      
      However, I must have tested it without having the Accounting set
      to enforce QOS's. So, if an admin modifies a job to a QOS they
      don't have access to, it will be modified, but the job will result
      in a state of InvalidQOS, which is reasonable, since this would
      handle the case where a user has their QOS removed. A problem,
      however, is that even though the scheduler won't schedule the job,
      backfill still will.
      
      One approach would be to fix backfill to be consistent with
      the scheduler (which should probably occur regardless), but
      my thought would be to modify the scheduler to allow the QOS
      as long as it was set by an admin, since that was the intent
      of the modification to begin with.
      
      I believe it  would only take a single line to change, just
      adding a check on the job_ptr->limit_set_qos, to make sure
      it was set by an admin:
      
                      if (job_ptr->qos_id) {
                              slurmdb_association_rec_t *assoc_ptr;
                              assoc_ptr = (slurmdb_association_rec_t *)job_ptr->assoc_ptr;
                              if (assoc_ptr &&
                                  !bit_test(assoc_ptr->usage->valid_qos,
                                            job_ptr->qos_id) &&
                                  !job_ptr->limit_set_qos) {
                                      info("sched: JobId=%u has invalid QOS",
                                              job_ptr->job_id);
                                      xfree(job_ptr->state_desc);
                                      job_ptr->state_reason = FAIL_QOS;
                                      continue;
                              } else if (job_ptr->state_reason == FAIL_QOS) {
                                      xfree(job_ptr->state_desc);
                                      job_ptr->state_reason = WAIT_NO_REASON;
                              }
                      }
      
      Phil
      7aef4f80
  4. 17 Jan, 2013 1 commit
    • Morris Jette's avatar
      Add support for configurable keep alive time for srun/slurmstep communications · 5752c6ce
      Morris Jette authored
      Added "KeepAliveTime" configuration parameter
      
      From Matthieu Hautreux:
      TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
      exchanges, thus letting the system potentially deadlocked if a remote host
      dissapear and the local host is waiting on a read (the write would result in a
      EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
      relatively large timeout value (5 minutes), could enhance the resilience of
      SLURM for unexpected packet/connection loss without too much implication on the
      scalability of the solution. The timeout could be configurable in case it is
      find too aggresive for particular configurations.
      5752c6ce
  5. 16 Jan, 2013 4 commits
  6. 15 Jan, 2013 2 commits
    • Matthieu Hautreux's avatar
      QoS limits enforcement: correct a bug with 0-valued per user used limits · 4136520d
      Matthieu Hautreux authored
      QoS limits enforcement on the controller side is based on a list of used_limits
      per user. When a user is not yet added to the list, which is common when the
      controller is restarted and the user has no running jobs, the current logic is
      to not check some of the "per user limits" and let the submission succeed.
      However, if one of these limits is a zero-valued limit, the check chould
      failed as it means that no job should be submitted at all as it would
      necessarily result in a crossing of the limit.
      
      This patch ensures that even when a user is not yet present in the per user
      used_limits list, the 0-valued limits are correctly treated.
      4136520d
    • David Bigagli's avatar
      Merge priority/multifactor2 plugin into priority/multifactor · 6596f17e
      David Bigagli authored
      Add PriorityFlags value of "TICKET_BASED".
      6596f17e
  7. 14 Jan, 2013 3 commits
  8. 11 Jan, 2013 1 commit
  9. 10 Jan, 2013 3 commits
  10. 09 Jan, 2013 2 commits
  11. 08 Jan, 2013 3 commits
  12. 03 Jan, 2013 5 commits
  13. 28 Dec, 2012 1 commit
  14. 22 Dec, 2012 1 commit
  15. 21 Dec, 2012 3 commits
    • Morris Jette's avatar
      Correct job time limit for sched/backfil and job has QOS with NO_RESERVE flag · 4652e982
      Morris Jette authored
      If sched/backfill starts a job with a QOS having NO_RESERVE and not job
      time limit, start it with the partition time limit (or one year if the
      partition has no time limit) rather than NO_VAL (140 year time limit);
      
      If a standby job, which in this
      case has the NO_RESERVE flag set, is submitted
      without a time limit, and is backfilled, it
      will get an EndTime waaayyyy into the future.
      
      JobId=99 Name=cmdll
         UserId=eckert(1043) GroupId=eckert(1043)
         Priority=12083 Account=sa QOS=standby
         JobState=RUNNING Reason=None Dependency=(null)
         Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
         RunTime=00:00:14 TimeLimit=12:00:00 TimeMin=N/A
         SubmitTime=2012-12-20T11:49:36 EligibleTime=2012-12-20T11:49:36
         StartTime=2012-12-20T11:49:44 EndTime=2149-01-26T18:16:00
      
      so I looked at the code in /src/plugins/sched/backfill:
      
                      if (job_ptr->start_time <= now) {
                              int rc = _start_job(job_ptr, resv_bitmap);
                              if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
                                      job_ptr->time_limit = orig_time_limit;
                                      job_ptr->end_time = job_ptr->start_time +
                                                          (orig_time_limit * 60);
      
      Using the debugger I found that if the job does not have a specified
      time limit, the job_ptr->time_limit is equal to NO_VAL when it hits
      this code.
      4652e982
    • Danny Auble's avatar
    • Morris Jette's avatar
      Added "HealthCheckNodeState" configuration parameter · b139f654
      Morris Jette authored
      Identify node states on which HealthCheckProgram should be executed.
      b139f654
  16. 20 Dec, 2012 4 commits
  17. 19 Dec, 2012 2 commits