• Marco Ehlert's avatar
    Correct partition MaxCPUsPerNode enforcement · 70aafa68
    Marco Ehlert authored
    I would like to mention a problem which seems to be a calculation bug of
    used_cores in slurm version 15.08.7
    
    If a node is divided into 2 partitions using MaxCPUsPerNode like this
    slurm.conf configuration
    
        NodeName=n1 CPUs=20
        PartitionName=cpu NodeName=n1    MaxCPUsPerNode=16
        PartitionName=gpu NodeName=n1    MaxCPUsPerNode=4
    
    I run into a strange scheduling situation.
    The situation occurs after a fresh restart of the slurmctld daemon.
    
    I start jobs one by one:
    
    case 1
        systemctl restart slurmctld.service
        sbatch -n 16 -p cpu cpu.sh
        sbatch -n 1  -p gpu gpu.sh
        sbatch -n 1  -p gpu gpu.sh
        sbatch -n 1  -p gpu gpu.sh
        sbatch -n 1  -p gpu gpu.sh
    
        => Problem now: The gpu jobs are kept in PENDING state.
    
    This picture changes if I start the jobs this way
    
    case 2
        systemctl restart slurmctld.service
        sbatch -n 1  -p gpu gpu.sh
        scancel <gpu job_id>
        sbatch -n 16 -p cpu cpu.sh
        sbatch -n 1  -p gpu gpu.sh
        sbatch -n 1  -p gpu gpu.sh
        sbatch -n 1  -p gpu gpu.sh
        sbatch -n 1  -p gpu gpu.sh
    
    and all jobs are running fine.
    
    By looking into the code I figured out a wrong calculation of 'used_cores' in
    function _allocate_sc()
    
    plugins/select/cons_res/job_test.c
    
    _allocate_sc(...)
    ...
             for (c = core_begin; c < core_end; c++) {
                     i = (uint16_t) (c - core_begin) / cores_per_socket;
    
                     if (bit_test(core_map, c)) {
                             free_cores[i]++;
                             free_core_count++;
                     } else {
                             used_cores[i]++;
                     }
                     if (part_core_map && bit_test(part_core_map, c))
                             used_cpu_array[i]++;
    
    This part of code seems to work only if the part_core_map exists for a
    partition or on a completly free node. But in case 1 there is no
    part_core_map for gpu created yet. Starting a gpu  the core_map contains
    4 cores left from the cpu job. Now all non free cores of the cpu partion
    are counted as used cores in the gpu partition and this condition will
    match in the next code parts
    
        free_cpu_count + used_cpu_count >  job_ptr->part_ptr->max_cpus_per_node
    
    what is definitely wrong.
    
    As soon as a part_core_map appears, means a gpu job was started on a free
    node (case 2) then there is no problem at all.
    
    To get case 1 work I changed the above code to the following and all works
    fine:
    
             for (c = core_begin; c < core_end; c++) {
                     i = (uint16_t) (c - core_begin) / cores_per_socket;
    
                    if (bit_test(core_map, c)) {
                             free_cores[i]++;
                             free_core_count++;
                     } else {
                         if (part_core_map && bit_test(part_core_map, c)){
                             used_cpu_array[i]++;
                             used_cores[i]++;
                         }
                     }
    
    I am not sure this code change is really good, but it fixes my problem.
    70aafa68
To find the state of this project's repository at the time of any of these versions, check out the tags.