- 26 Mar, 2013 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
slurm.schedmd.com
-
Danny Auble authored
-
Danny Auble authored
a reservation when it has the "Ignore_Jobs" flag set. Since jobs could run outside of the reservation in it's nodes without this you could have double time.
-
- 25 Mar, 2013 6 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
This is not applicable with launch/aprun
-
Morris Jette authored
-
Hongjia Cao authored
-
Hongjia Cao authored
-
- 24 Mar, 2013 1 commit
-
-
jette authored
-
- 23 Mar, 2013 1 commit
-
-
Lipari, Don authored
-
- 22 Mar, 2013 2 commits
-
-
Andy Wettstein authored
On Redhat 6 based distros the lua library name is liblua-5.1.so. Installing the lua-devel package will create the liblua.so symlink, but if that isn't installed then the lua job submit plugin will fail to load. I'm attaching a patch that adds liblua-5.1.so to the search path.
-
Morris Jette authored
These changes are required so that select/cray can load select/linear, which is a bit more complex than the other select plugin structures. Export plugin_context_create and plugin_context_destroy symbols from libslurm.so. Correct typo in exported hostlist_sort symbol name Define some functions in select/cray to avoid undefined symbols if the plugin is loaded via libslurm rather than from a slurm command (which has all of the required symbols)
-
- 21 Mar, 2013 1 commit
-
-
Morris Jette authored
-
- 20 Mar, 2013 5 commits
-
-
Hongjia Cao authored
-
Danny Auble authored
cluster.
-
jette authored
-
jette authored
-
Morris Jette authored
-
- 19 Mar, 2013 8 commits
-
-
Don Lipari authored
-
Morris Jette authored
-
Hongjia Cao authored
select()/FD_ISSET() does not work for file descriptor larger than 1023.
-
Morris Jette authored
-
Hongjia Cao authored
avoid add/remove node resource of job if the node is lost by resize I found another case that idle node can not be allocated. It can be reproduced as follows: 1. run a job with -k option: [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000 srun: error: Node failure on cn28 srun: error: Node failure on cn28 srun: error: cn28: task 10: Killed ^Csrun: interrupt (one more within 1 sec to abort) srun: tasks 0-9: running srun: task 10: exited abnormally ^Csrun: sending Ctrl-C to job 106120.0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2. set a node down and then set it idle: [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao test" [root@mn0 ~]# scontrol update nodename=cn28 state=idle 3. restart slurmctld [root@mn0 ~]# service slurm restart stopping slurmctld: [ OK ] slurmctld is stopped starting slurmctld: [ OK ] 4. cancel the job then, the node set down will be left unavailable: [root@mn0 ~]# sinfo -n cn[18-28] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST work* up infinite 11 idle cn[18-28] [root@mn0 ~]# srun -w cn[18-28] hostname srun: job 106122 queued and waiting for resources [root@mn0 slurm]# grep cn28 slurmctld.log [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in exclusive use [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in exclusive use I made an attempt to fix this by the attached patch. Please review it.
-
Morris Jette authored
I don't believe save_time_limit was redundant. At least in this case: if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){ if (orig_time_limit == NO_VAL) orig_time_limit = comp_time_limit; job_ptr->time_limit = orig_time_limit; [...] So later, when updating the db, if (save_time_limit != job_ptr->time_limit) jobacct_storage_g_job_start(acct_db_conn, job_ptr); will cause the db to be updated, while, if (orig_time_limit != job_ptr->time_limit) jobacct_storage_g_job_start(acct_db_conn, job_ptr); will not because job_ptr->time_limit now equals orig_time_limit.
-
Morris Jette authored
-
Don Lipari authored
Without this change, if the job's time limit is modified down toward --time-min by the backfill scheduler, update the job's time limit in the database.
-
- 14 Mar, 2013 4 commits
-
-
Danny Auble authored
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
- 13 Mar, 2013 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
If step requests more CPUs than possible in specified node count of job allocation then return ESLURM_TOO_MANY_REQUESTED_CPUS rather than ESLURM_NODES_BUSY and retrying.
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
- 12 Mar, 2013 2 commits
-
-
Morris Jette authored
-
Magnus Jonsson authored
I found a bug in cons_res/select_p_select_nodeinfo_set_all. If a node is part of two (or more) partitions the code will only count the number of cores/cpus in the partition that has the most running jobs on that node. Patch attached to fix the problem. I also added an new function to bitstring to count the number of bits in an range (bit_set_count_range) and made a minor improvement of (bit_set_count) while reviewing the range version. Best regards, Magnus
-
- 11 Mar, 2013 1 commit
-
-
Morris Jette authored
This permits default reservation names to be more easily managed
-