1. 20 Mar, 2013 2 commits
  2. 19 Mar, 2013 8 commits
    • Don Lipari's avatar
    • Morris Jette's avatar
    • Hongjia Cao's avatar
      change select() to poll() in waiting for a socket to be readable · 3175cf91
      Hongjia Cao authored
      select()/FD_ISSET() does not work for file descriptor larger than 1023.
      3175cf91
    • Morris Jette's avatar
      Note nature of latest change · 8e038b5c
      Morris Jette authored
      8e038b5c
    • Hongjia Cao's avatar
      fix of idle nodes cannot be allocated · 4ea9850a
      Hongjia Cao authored
      avoid add/remove node resource of job if the node is lost by resize
      
       I found another case that idle node can not be allocated. It can be
      reproduced as follows:
      
      1. run a job with -k option:
      
          [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000
          srun: error: Node failure on cn28
          srun: error: Node failure on cn28
          srun: error: cn28: task 10: Killed
          ^Csrun: interrupt (one more within 1 sec to abort)
          srun: tasks 0-9: running
          srun: task 10: exited abnormally
          ^Csrun: sending Ctrl-C to job 106120.0
          srun: Job step aborted: Waiting up to 2 seconds for job step to
      finish.
      
      2. set a node down and then set it idle:
      
          [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao
      test"
          [root@mn0 ~]# scontrol update nodename=cn28 state=idle
      
      3. restart slurmctld
      
          [root@mn0 ~]# service slurm restart
          stopping slurmctld:                                        [  OK  ]
          slurmctld is stopped
          starting slurmctld:                                        [  OK  ]
      
      4. cancel the job
      
      then, the node set down will be left unavailable:
      
          [root@mn0 ~]# sinfo -n cn[18-28]
          PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
          work*        up   infinite     11   idle cn[18-28]
      
          [root@mn0 ~]# srun -w cn[18-28] hostname
          srun: job 106122 queued and waiting for resources
      
          [root@mn0 slurm]# grep cn28 slurmctld.log
          [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in
      exclusive use
          [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in
      exclusive use
      
      I made an attempt to fix this by the attached patch. Please review it.
      4ea9850a
    • Morris Jette's avatar
      Correction in logic issuing call to account for change in job time limit · 9f5a7a0e
      Morris Jette authored
      I don't believe save_time_limit was redundant.  At least in this case:
      
      if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
          if (orig_time_limit == NO_VAL)
              orig_time_limit = comp_time_limit;
          job_ptr->time_limit = orig_time_limit;
      [...]
      
      So later, when updating the db,
      
          if (save_time_limit != job_ptr->time_limit)
              jobacct_storage_g_job_start(acct_db_conn,
                              job_ptr);
      will cause the db to be updated, while,
      
              if (orig_time_limit != job_ptr->time_limit)
              jobacct_storage_g_job_start(acct_db_conn,
                              job_ptr);
      
      will not because job_ptr->time_limit now equals orig_time_limit.
      9f5a7a0e
    • Morris Jette's avatar
    • Don Lipari's avatar
      Record updated job time limit if modified by backfill · 46348f91
      Don Lipari authored
      Without this change, if the job's time limit is modified down
      toward --time-min by the backfill scheduler, update the job's
      time limit in the database.
      46348f91
  3. 14 Mar, 2013 4 commits
  4. 13 Mar, 2013 5 commits
  5. 12 Mar, 2013 2 commits
    • Morris Jette's avatar
      Minor format changes from previous commit · f5a89755
      Morris Jette authored
      f5a89755
    • Magnus Jonsson's avatar
      Fix scheduling if node in more than one partition · fcef06b4
      Magnus Jonsson authored
      I found a bug in cons_res/select_p_select_nodeinfo_set_all.
      
      If a node is part of two (or more) partitions the code will only count the number of cores/cpus in the partition that has the most running jobs on that node.
      
      Patch attached to fix the problem.
      
      I also added an new function to bitstring to count the number of bits in an range (bit_set_count_range) and made a minor improvement of (bit_set_count) while reviewing the range version.
      
      Best regards,
      Magnus
      fcef06b4
  6. 11 Mar, 2013 8 commits
  7. 08 Mar, 2013 7 commits
  8. 07 Mar, 2013 1 commit
    • jette's avatar
      GRES topology bug in core selection logic fixed. · 07eb5d24
      jette authored
      This problem would effect systems in which specific GRES are associated
      with specific CPUs.
      One possible result is the CPUs identified as usable could be inappropriate
      and job would be held when trying to layout out the tasks on CPUs (all
      done as part of the job allocation process).
      The other problem is that if multiple GRES are linked to specific CPUs,
      there was a CPU bitmap OR which should have been an AND, resulting in
      some CPUs being identified as usable, but not available to all GRES.
      07eb5d24
  9. 06 Mar, 2013 3 commits