• Morris Jette's avatar
    Modify the step completion RPC between slurmd and slurmstepd · ed31e6c7
    Morris Jette authored
    in the tightly coupled functions slurmd:stepd_completion and
    slurmstepd:_handle_completion, a jobacct structure is
    send from the main daemon to the step daemon to provide
    the statistics of the children slurmstepd and do the aggregation.
    
    The methodology used to send the structure is the use of
    jobacct_gather_g_{setinfo,getinfo} over a pipe (JOBACCT_DATA_PIPE).
    As {setinfo,getinfo} use a common internal lock and reading
    or writing to a pipe is equivalent to holding a lock, slurmd and
    slurmstepd have to avoid using both setinfo and getinfo over a
    pipe or deadlock situations can occured. For example :
    slurmd(lockforread,write)/slurmstepd(write,lockforread).
    
    This patch remove the call to jobacct_gather_g_setinfo in slurmd
    and the call to jobacct_gather_g_getinfo in slurmstepd ensuring
    that slurmd only do getinfo operations over a pipe and slurmstepd
    only do setinfo over a pipe. Instead jobacct_gather_g_{pack,unpack}
    are used to marshall/unmarshall the data for transmission over the
    pipe.
    Patch by Matthieu Hautreux, CEA.
    
    The patch committed here is a variation on the work by Matthieu.
    Specifically, the logic is added to slurmstepd to read a new format
    of RPC including an RPC version number and buffer with the data
    structure. The slurmd however will not send the RPC in the new format
    until SLURM version 2.5.
    ed31e6c7
To find the state of this project's repository at the time of any of these versions, check out the tags.