Wrong job disposition with horizontal-vertical wrapper on auto-monarch
Hi @dbeltran
After doing several tests with the horizontal-vertical wrapper for the SIM & DA jobs on auto-monarch, I found some problems that I would like to check with you. The case is that we have 2 startdates, a SIM job with 12 members and a DA job with only one member, and we want to run together at the same time, for all chunks, the 12 SIM members from one startdate with the DA job the other startdate (13 jobs in total). The issues I found (it may be possible that they are related to each other):
- For the simple case of running 2 chunks for each startdate, the wrapper seems to work well only when the total number of processors is less or equal than the maximum number of cores available (for the PRACE accounts, 19200). An example here (experiment
a1q7
):
The SIM job processors value is set to 288 and the DA is set to 256. The actual number of processes that will run at the same time is 12x288+256 = 3712, and the total number of processors for all the jobs in the wrapper is 36x288+256x4 = 11392. Both values are less than the 19200 max. cores allowed, so the wrapper is well set and contains all the jobs: 288cores_2chunks.pdf
But if we increase the number of processors for the SIM job to 528, the number of processors needed at the same time is now 12x528+256 = 6592, and the total for the whole wrapper is 36x528+256x4 = 20032. The second value is greater than the max. allowed cores but, if I'm not wrong, it should work as well because we don't need more than the maximum cores at the same time. Instead of that, some jobs are left outside the wrapper (it only contains 29 jobs): 528cores_2chunks.pdf
Despite this, the wrapper job seems to be submitted to marenostrum with the correct number of cores: a1q7_ASThread_15505623484889_6592_29.cmd
Could it be possible that the wrapper is limitng the number of jobs inside it by the number of processors needed for the whole wrapper instead of just taking into account the number of processes to be run at the same time?
- Second, I found that when increasing the number of chunks there's also a misplacing of the jobs in the wrapper. In this case I also increased the MAXWAITINGJOBS and TOTALJOBS autosubmit variables (don't know if both of them are necessary but I set both to 1000) as we submit a lot more jobs at the same time. Here is an example with 10 chunks (experiment
a1oe
, SIM processors is set to 192 and DA processors is set to 144): a1oe_20190218_1229.pdf
Here the wrapper contains 228 jobs (210 SIM and 18 DA), but for some reason there are some other SIM and DA jobs left outside the wrapper. And apparently there's also a problem with the order and the dependencies of the jobs, as according to the plot it seems that it will start with chunks 5 and 9, instead of chunk 1.
Don't know if I have explained it very clearly as it's a bit confusing, if you have any doubt just ask me.