NEWS · 6a103f2e5350462c762231f446904e23067504fe · Manuel G. Marciani / ces_slurm_simulator · GitLab

Find file Blame History Permalink

Correct -mem-per-cpu logic for multiple threads per core · 6a103f2e

Morris Jette authored Oct 02, 2012

See bugzilla bug 132

When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be
overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD
(2.5.0-pre3).

Conditions:
-----------
* SelectType=select/cons_res
* SelectTypeParameters=CR_Core_Memory
* Using threads
  - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2
RealMemory=400"

Description:
------------
In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has
sufficient memory for a job. However, the per-CPU memory limits appear to be
scaled by the number of threads. This new value may exceed the available memory
on the node. And, once a node is overcommitted on memory, future memory checks
in _verify_node_state() will always succeed.

Scenario to reproduce:
----------------------
With the example node linux0, we run a single-core job with 250MB/core
    srun --mem-per-cpu=250 sleep 60

cons_res checks that it will fit: ((real - alloc) >= job mem)
    ((400 - 0) >= 250) and the job starts

Then, the memory requirement is doubled:
    "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for
job X"
    "slurmd: scaling CPU count by factor of 2"

This job should not have started

While the first job is still running, we submit a second, identical job
    srun --mem-per-cpu=250 sleep 60

cons_res checks that it will fit:
    ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job
starts

This second job also should not have started

6a103f2e

To find the state of this project's repository at the time of any of these versions, check out the tags.