- 20 Mar, 2013 2 commits
-
-
jette authored
-
Morris Jette authored
-
- 19 Mar, 2013 8 commits
-
-
Don Lipari authored
-
Morris Jette authored
-
Hongjia Cao authored
select()/FD_ISSET() does not work for file descriptor larger than 1023.
-
Morris Jette authored
-
Hongjia Cao authored
avoid add/remove node resource of job if the node is lost by resize I found another case that idle node can not be allocated. It can be reproduced as follows: 1. run a job with -k option: [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000 srun: error: Node failure on cn28 srun: error: Node failure on cn28 srun: error: cn28: task 10: Killed ^Csrun: interrupt (one more within 1 sec to abort) srun: tasks 0-9: running srun: task 10: exited abnormally ^Csrun: sending Ctrl-C to job 106120.0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2. set a node down and then set it idle: [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao test" [root@mn0 ~]# scontrol update nodename=cn28 state=idle 3. restart slurmctld [root@mn0 ~]# service slurm restart stopping slurmctld: [ OK ] slurmctld is stopped starting slurmctld: [ OK ] 4. cancel the job then, the node set down will be left unavailable: [root@mn0 ~]# sinfo -n cn[18-28] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST work* up infinite 11 idle cn[18-28] [root@mn0 ~]# srun -w cn[18-28] hostname srun: job 106122 queued and waiting for resources [root@mn0 slurm]# grep cn28 slurmctld.log [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in exclusive use [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in exclusive use I made an attempt to fix this by the attached patch. Please review it.
-
Morris Jette authored
I don't believe save_time_limit was redundant. At least in this case: if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){ if (orig_time_limit == NO_VAL) orig_time_limit = comp_time_limit; job_ptr->time_limit = orig_time_limit; [...] So later, when updating the db, if (save_time_limit != job_ptr->time_limit) jobacct_storage_g_job_start(acct_db_conn, job_ptr); will cause the db to be updated, while, if (orig_time_limit != job_ptr->time_limit) jobacct_storage_g_job_start(acct_db_conn, job_ptr); will not because job_ptr->time_limit now equals orig_time_limit.
-
Morris Jette authored
-
Don Lipari authored
Without this change, if the job's time limit is modified down toward --time-min by the backfill scheduler, update the job's time limit in the database.
-
- 14 Mar, 2013 4 commits
-
-
Danny Auble authored
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
- 13 Mar, 2013 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
If step requests more CPUs than possible in specified node count of job allocation then return ESLURM_TOO_MANY_REQUESTED_CPUS rather than ESLURM_NODES_BUSY and retrying.
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
- 12 Mar, 2013 2 commits
-
-
Morris Jette authored
-
Magnus Jonsson authored
I found a bug in cons_res/select_p_select_nodeinfo_set_all. If a node is part of two (or more) partitions the code will only count the number of cores/cpus in the partition that has the most running jobs on that node. Patch attached to fix the problem. I also added an new function to bitstring to count the number of bits in an range (bit_set_count_range) and made a minor improvement of (bit_set_count) while reviewing the range version. Best regards, Magnus
-
- 11 Mar, 2013 8 commits
-
-
Morris Jette authored
This permits default reservation names to be more easily managed
-
Andy Wettstein authored
-
Nathan Yee authored
Without this change, when the sbatch --export option is used, many Slurm environment variables are not set unless explicitly exported.
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
-
jette authored
-
Dmitri Gribenko authored
-
- 08 Mar, 2013 7 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Nathan Yee authored
-
Nathan Yee authored
-
jette authored
-
jette authored
-
Danny Auble authored
success
-
- 07 Mar, 2013 1 commit
-
-
jette authored
This problem would effect systems in which specific GRES are associated with specific CPUs. One possible result is the CPUs identified as usable could be inappropriate and job would be held when trying to layout out the tasks on CPUs (all done as part of the job allocation process). The other problem is that if multiple GRES are linked to specific CPUs, there was a CPU bitmap OR which should have been an AND, resulting in some CPUs being identified as usable, but not available to all GRES.
-
- 06 Mar, 2013 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Danny Auble authored
-