• Morris Jette's avatar
    task/cgroup - Fix task binding to CPUs bug · ddf6d9a4
    Morris Jette authored
    There was a subtle bug in how tasks were bound to CPUs which could result
    in an "infinite loop" error. The problem was various socket/core/threasd
    calculations were based upon the resources allocated to a step rather than
    all resources on the node and rounding errors could occur. Consider for
    example a node with 2 sockets, 6 cores per socket and 2 threads per core.
    On the idle node, a job requesting 14 CPUs is submitted. That job would
    be allocted 4 cores on the first socket and 3 cores on the second socket.
    The old logic would get the number of sockets for the job at 2 and the
    number of cores at 7, then calculate the number of cores per socket at
    7/2 or 3 (rounding down to an integer). The logic layouting out tasks
    would bind the first 3 cores on each socket to the job then not find any
    remaining cores, report the "infinite loop" error to the user, and run
    the job without one of the expected cores. The problem gets even worse
    when there are some allocated cores on a node. In a more extreme case,
    a job might be allocated 6 cores on one socket and 1 core on a second
    socket. In that case, 3 of that job's cores would be unused.
    bug 2502
    ddf6d9a4
To find the state of this project's repository at the time of any of these versions, check out the tags.