retry slurm.conf file
Add logic to sleep and retry if slurm.conf can't be read. Without this, the slurmd daemons may die and when the SlurmdTimeout is reached, the nodes will be marked DOWN and their jobs will be killed. In the long term, it would be good to exit only if the read files on program startup, and the daemons keep running with old configuration on reconfiguration, but I don't have time to do that work now.
Please register or sign in to comment