Energy use collection logic
Attached is the energy accounting patch that Martin and Yiannis have been working. The framework is there, but the functionality it currently not working. They are both on vacation this week and then are back a week before the conference. I thought it would be better to send in order to get the framework and the structures in place for an official 2.5.0 instead of waiting. If you disagree, just let us know and we can send it again when the low level functionality working. Here is a short summary of our test results. 1. jobacct_gather/none + energy_accounting/none Looks OK. Did not find any errors. 2. jobacct_gather/linux or cgroup + energy_accounting/none Looks OK. Did not find any errors. 3. jobacct_gather/linux or cgroup + energy_accounting/rapl Slurmd aborts when you run a job that uses a node that does not support RAPL. This appears to be because of the error()/pexit() at line# 150/151 in energy_accounting_rapl.c. We need to change this code to just issue a debug message and return. For now, energy_accounting must not be configured if the cluster includes any nodes that do not support RAPL. The cpu frequency values reported by jobacct_gather are not correct. Again, there are obviously some problems, so if it would be better to wait for full functionality just let us know. It may be three weeks before they are able to spend some time on this to fix the problems, so that is why I thought you may prefer to have something that has the correctly data structures in sooner rather than later.
Please register or sign in to comment