In order to understand the hashing logic of slurm and see where was the
limitation in terms of scalability, I did some tests with cluster sizes varying from 64 to 64k nodes. I saw the same kind of performance issues that you saw with large sizes (more than a few thousands nodes). Looking at the code, I noticed that the hash table of the node records was not built before the construction of the topology information in slurmctld ! Making sure that the hash table is present really reduces the built time. This patch ensures that and enable to get sub-second built time for topology information in scenarios that were used to require a minute before. I think that using this patch, the load time of slurm confs is less problematic.
Please register or sign in to comment