1. 04 Jan, 2013 3 commits
    • Mark A. Grondona's avatar
      mpi/mvapich: Don't set MPIRUN_PROCESSES by default · fd5b0e56
      Mark A. Grondona authored
      The MPIRUN_PROCESSES variable set by the mpi/mvapich plugin probably
      is not needed for most if not all recent versions of mvapich.
      This environment variable also negatively affects job scalability
      since its length is proportional to the number of tasks in a job.
      In fact, for very large jobs, the increased environment size can
      lead to failures in execve(2).
      
      Since MPIRUN_PROCESSES *might* be required in some older versions of
      mvapich, this patch disables the setting of that variable completely
      only if SLURM_NEED_MVAPICH_MPIRUN_PROCESSES is not set in the job's
      environment. (Thus, by default MPIRUN_PROCESSES is disabled, but
      the old behavior may be restored by setting the environment variable
      above)
      fd5b0e56
    • jette's avatar
      b196f153
    • jette's avatar
      Fix logic in hostset_create for invalid input · 33cb1e40
      jette authored
      33cb1e40
  2. 03 Jan, 2013 16 commits
  3. 02 Jan, 2013 1 commit
    • Morris Jette's avatar
      Revert commit b2c18ec1 · ac27d503
      Morris Jette authored
      The original patch works fine to avoid cancelling a job when all
      of it's nodes go unresponsive, but I don't see any way to easily
      address nodes coming back into service. We want to cancel jobs
      that have some up nodes and some down nodes, but the nodes will
      come back into service indivually rather than all at once.
      ac27d503
  4. 31 Dec, 2012 1 commit
  5. 29 Dec, 2012 3 commits
  6. 28 Dec, 2012 8 commits
  7. 27 Dec, 2012 4 commits
  8. 22 Dec, 2012 2 commits
  9. 21 Dec, 2012 2 commits
    • Morris Jette's avatar
      Correct job time limit for sched/backfil and job has QOS with NO_RESERVE flag · 4652e982
      Morris Jette authored
      If sched/backfill starts a job with a QOS having NO_RESERVE and not job
      time limit, start it with the partition time limit (or one year if the
      partition has no time limit) rather than NO_VAL (140 year time limit);
      
      If a standby job, which in this
      case has the NO_RESERVE flag set, is submitted
      without a time limit, and is backfilled, it
      will get an EndTime waaayyyy into the future.
      
      JobId=99 Name=cmdll
         UserId=eckert(1043) GroupId=eckert(1043)
         Priority=12083 Account=sa QOS=standby
         JobState=RUNNING Reason=None Dependency=(null)
         Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
         RunTime=00:00:14 TimeLimit=12:00:00 TimeMin=N/A
         SubmitTime=2012-12-20T11:49:36 EligibleTime=2012-12-20T11:49:36
         StartTime=2012-12-20T11:49:44 EndTime=2149-01-26T18:16:00
      
      so I looked at the code in /src/plugins/sched/backfill:
      
                      if (job_ptr->start_time <= now) {
                              int rc = _start_job(job_ptr, resv_bitmap);
                              if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
                                      job_ptr->time_limit = orig_time_limit;
                                      job_ptr->end_time = job_ptr->start_time +
                                                          (orig_time_limit * 60);
      
      Using the debugger I found that if the job does not have a specified
      time limit, the job_ptr->time_limit is equal to NO_VAL when it hits
      this code.
      4652e982
    • Danny Auble's avatar
      Fix unused variable on frontend system · d934fe6e
      Danny Auble authored
      d934fe6e