• Dorian Krause's avatar
    Fix to GRES NoConsume logic · 33c48ac5
    Dorian Krause authored
    we came across the following error message in the slurmctld logs when
    using non-consumable resources:
    
    error: gres/potion: job 39 dealloc of node node1 bad node_offset 0 count
    is 0
    
    The error comes from _job_dealloc():
    
    node_gres_data=0x7f8a18000b70, node_offset=0, gres_name=0x1999e00
    "potion", job_id=46,
        node_name=0x1987ab0 "node1") at gres.c:3980
    (job_gres_list=0x199b7c0, node_gres_list=0x199bc38, node_offset=0,
    job_id=46,
        node_name=0x1987ab0 "node1") at gres.c:4190
    job_ptr=0x19e9d50, pre_err=0x7f8a31353cb0 "_will_run_test", remove_all=true)
        at select_linear.c:2091
    bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, max_share=1, req_nodes=1,
        preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40) at
    select_linear.c:3176
    bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
        preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
    exc_core_bitmap=0x0) at select_linear.c:3390
    bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
        preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
    exc_core_bitmap=0x0) at node_select.c:588
    avail_bitmap=0x7f8a2f910d38, min_nodes=1, max_nodes=1, req_nodes=1,
    exc_core_bitmap=0x0)
        at backfill.c:367
    
    The cause of this problem is that _node_state_dup() in gres.c does not
    duplicate the no_consume flag.
    The cr_ptr passed to _rm_job_from_nodes() is created with _dup_cr()
    which calls _node_state_dup().
    
    Below is a simple patch to fix the problem. A "future-proof" alternative
    might be to memcpy() from gres_ptr to new_gres and
    only handle pointers separately.
    33c48ac5
To find the state of this project's repository at the time of any of these versions, check out the tags.