1. 26 Mar, 2013 3 commits
  2. 25 Mar, 2013 6 commits
  3. 24 Mar, 2013 1 commit
  4. 23 Mar, 2013 1 commit
  5. 22 Mar, 2013 2 commits
    • Andy Wettstein's avatar
      Add path for liblua · 9112d154
      Andy Wettstein authored
      On Redhat 6 based distros the lua library name is liblua-5.1.so.
      Installing the lua-devel package will create the liblua.so symlink, but
      if that isn't installed then the lua job submit plugin will fail to
      load.
      I'm attaching a patch that adds liblua-5.1.so to the search path.
      9112d154
    • Morris Jette's avatar
      Select/cray - Modify build to enable direct use of libslurm library. · 7d4f145a
      Morris Jette authored
      These changes are required so that select/cray can load select/linear,
        which is a bit more complex than the other select plugin structures.
      Export plugin_context_create and plugin_context_destroy symbols from
        libslurm.so.
      Correct typo in exported hostlist_sort symbol name
      Define some functions in select/cray to avoid undefined symbols if
        the plugin is loaded via libslurm rather than from a slurm command
        (which has all of the required symbols)
      7d4f145a
  6. 21 Mar, 2013 1 commit
  7. 20 Mar, 2013 5 commits
  8. 19 Mar, 2013 8 commits
    • Don Lipari's avatar
    • Morris Jette's avatar
    • Hongjia Cao's avatar
      change select() to poll() in waiting for a socket to be readable · 3175cf91
      Hongjia Cao authored
      select()/FD_ISSET() does not work for file descriptor larger than 1023.
      3175cf91
    • Morris Jette's avatar
      Note nature of latest change · 8e038b5c
      Morris Jette authored
      8e038b5c
    • Hongjia Cao's avatar
      fix of idle nodes cannot be allocated · 4ea9850a
      Hongjia Cao authored
      avoid add/remove node resource of job if the node is lost by resize
      
       I found another case that idle node can not be allocated. It can be
      reproduced as follows:
      
      1. run a job with -k option:
      
          [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000
          srun: error: Node failure on cn28
          srun: error: Node failure on cn28
          srun: error: cn28: task 10: Killed
          ^Csrun: interrupt (one more within 1 sec to abort)
          srun: tasks 0-9: running
          srun: task 10: exited abnormally
          ^Csrun: sending Ctrl-C to job 106120.0
          srun: Job step aborted: Waiting up to 2 seconds for job step to
      finish.
      
      2. set a node down and then set it idle:
      
          [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao
      test"
          [root@mn0 ~]# scontrol update nodename=cn28 state=idle
      
      3. restart slurmctld
      
          [root@mn0 ~]# service slurm restart
          stopping slurmctld:                                        [  OK  ]
          slurmctld is stopped
          starting slurmctld:                                        [  OK  ]
      
      4. cancel the job
      
      then, the node set down will be left unavailable:
      
          [root@mn0 ~]# sinfo -n cn[18-28]
          PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
          work*        up   infinite     11   idle cn[18-28]
      
          [root@mn0 ~]# srun -w cn[18-28] hostname
          srun: job 106122 queued and waiting for resources
      
          [root@mn0 slurm]# grep cn28 slurmctld.log
          [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in
      exclusive use
          [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in
      exclusive use
      
      I made an attempt to fix this by the attached patch. Please review it.
      4ea9850a
    • Morris Jette's avatar
      Correction in logic issuing call to account for change in job time limit · 9f5a7a0e
      Morris Jette authored
      I don't believe save_time_limit was redundant.  At least in this case:
      
      if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
          if (orig_time_limit == NO_VAL)
              orig_time_limit = comp_time_limit;
          job_ptr->time_limit = orig_time_limit;
      [...]
      
      So later, when updating the db,
      
          if (save_time_limit != job_ptr->time_limit)
              jobacct_storage_g_job_start(acct_db_conn,
                              job_ptr);
      will cause the db to be updated, while,
      
              if (orig_time_limit != job_ptr->time_limit)
              jobacct_storage_g_job_start(acct_db_conn,
                              job_ptr);
      
      will not because job_ptr->time_limit now equals orig_time_limit.
      9f5a7a0e
    • Morris Jette's avatar
    • Don Lipari's avatar
      Record updated job time limit if modified by backfill · 46348f91
      Don Lipari authored
      Without this change, if the job's time limit is modified down
      toward --time-min by the backfill scheduler, update the job's
      time limit in the database.
      46348f91
  9. 14 Mar, 2013 4 commits
  10. 13 Mar, 2013 5 commits
  11. 12 Mar, 2013 2 commits
    • Morris Jette's avatar
      Minor format changes from previous commit · f5a89755
      Morris Jette authored
      f5a89755
    • Magnus Jonsson's avatar
      Fix scheduling if node in more than one partition · fcef06b4
      Magnus Jonsson authored
      I found a bug in cons_res/select_p_select_nodeinfo_set_all.
      
      If a node is part of two (or more) partitions the code will only count the number of cores/cpus in the partition that has the most running jobs on that node.
      
      Patch attached to fix the problem.
      
      I also added an new function to bitstring to count the number of bits in an range (bit_set_count_range) and made a minor improvement of (bit_set_count) while reviewing the range version.
      
      Best regards,
      Magnus
      fcef06b4
  12. 11 Mar, 2013 2 commits