- 28 Jan, 2013 5 commits
-
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 26 Jan, 2013 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 25 Jan, 2013 7 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
No change in logic yet, just a variable rename
-
Morris Jette authored
-
Morris Jette authored
-
- 24 Jan, 2013 6 commits
-
-
Morris Jette authored
Add squeue options to print array_job_id and array_task_id Change the environment variables SLURM_ARRAY_JOBID to SLURM_ARRAY_JOB_ID and SLURM_ARRAY_ID to SLURM_ARRAY_TASK_ID Substantial updates to web page
-
Morris Jette authored
-
Morris Jette authored
Put "switches" in alphabetic order Remove "\n" from switches output, that adds extra space in display
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 23 Jan, 2013 9 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
I run into a problem with slurm-2.5.1 that IDLE nodes can not be allocated to jobs. This can be reproduced as follows: First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to allocate nodes exclusively by default). Then set one of the nodes allocated to the job(cn2) to state DOWN: srun: error: Node failure on cn2 srun: error: Node failure on cn2 srun: error: cn2: task 0: Killed ^Csrun: interrupt (one more within 1 sec to abort) srun: task 1: running srun: task 0: exited abnormally ^Csrun: sending Ctrl-C to job 22605.0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: Force Terminated job step 22605.0 Then change state of the node to IDLE again. But it can not be allocated to jobs: srun: job 22606 queued and waiting for resources JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 22606 work hostname root PD 0:00 1 (Resources) 22604 work sbatch root R 3:06 1 cn1 NodeName=cn2 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc Gres=(null) NodeAddr=cn2 NodeHostName=cn2 OS=Linux RealMemory=30000 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 I traced and located the problem in select/cons_res. The call sequence is: slurmctld/node_mgr.c: update_node() => slurmctld/job_mgr.c: kill_running_job_by_node_name() => excise_node_from_job() => plugins/select/cons_res/select_cons_res.c: select_p_job_resized() => _rm_job_from_one_node() => _build_row_bitmaps() => common/job_resources: remove_job_from_cores() If there are other jobs running in the partition, the partition row bitmap will not be set correctly. In the example above, before _build_row_bitmaps(), output of _dump_part() is: [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1 [2013-01-19T13:24:56+08:00] row0: num_jobs 2: bitmap: 16,32-63 after setting the node down, output of _dump_part() is [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1 [2013-01-19T13:24:56+08:00] row0: num_jobs 2: bitmap: 16,32-47 Cores of cn2 are not marked as available. Instead, cores of other nodes are released. When another job requires the node cn2, the following log message appears: [2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy I do not understand the design of select/cons_res well and I do not know how to fix this. But it seems that _build_row_bitmaps() should not be called, since the job is not removed totally, but only one of the nodes released.
-
Morris Jette authored
-
Morris Jette authored
-
- 22 Jan, 2013 6 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
jette authored
Conflicts: doc/html/Makefile.am doc/html/Makefile.in
-
Magnus Jonsson authored
-
jette authored
Correction to CPU allocation logic for cores without hyperthreading Backport of https://github.com/SchedMD/slurm/commit/1ef41ac9590e018e631eaefb31254622984b7d2d
-
jette authored
-
- 19 Jan, 2013 2 commits
- 18 Jan, 2013 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
-