Commit f288e4eb authored by Dorian Krause's avatar Dorian Krause Committed by Morris Jette
Browse files

Job step memory allocation logic fix

This commit fixes a bug we observed when combining select/linear with
gres. If an allocation was requested with a --gres argument an srun
execution within that allocation would stall indefinitely:

-bash-4.1$ salloc -N 1 --gres=gpfs:100
salloc: Granted job allocation 384049
bash-4.1$ srun -w j3c017 -n 1 hostname
srun: Job step creation temporarily disabled, retrying

The slurmctld log showed:

debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1
debug3:    cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017
debug3:    host=j3l02 port=33608 name=hostname network=(null) exclusive=0
debug3:    checkpoint-dir=/home/user checkpoint_int=0
debug3:    mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0
debug3:    overcommit=0 time_limit=0 gres=(null) constraints=(null)
debug:  Configuration for job 384049 complete
_pick_step_nodes: some requested nodes j3c017 still have memory used by other steps
_slurm_rpc_job_step_create for job 384049: Requested nodes are busy

If srun --exclusive would have be used instead everything would work fine.
The reason is that in exclusive mode the code properly checks whether memory
is a reserved resource in the _pick_step_node() function.
This commit modifies the alternate code path to do the same.
parent b7cf0a28
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment