vertical-horizontal wrapper not working properly
there are cases in which the horizional-vertical wrapper starts to behave strangely (not clear yet to me if this happens after a job falure or systematically).
here are three issues (a, b,c) when I was running 2 starting dates (with the crossing-date strategy): from 1 Jan and 1 Jul for 1 month each in exp a1yw.
- first problematic wrapper:
it has created this wrapper job with only Jul SIMs and DA (not Jan)
ls -l /gpfs/scratch/pr1ehu00/pr1ehu15/a1yw/LOG_a1yw/a1yw_ASThread_15600045392833_15552_195.cmd
-rwxrwxr-x 1 pr1ehu15 pr1ehu00 11049 jun 8 16:45 /gpfs/scratch/pr1ehu00/pr1ehu15/a1yw/LOG_a1yw/a1yw_ASThread_15600045392833_15552_195.cmd
or
/esarchive/autosubmit/a1yw/tmp> vi a1yw_ASThread_15600045392833_15552_195.cmd
I think the log is this one:
/esarchive/autosubmit/a1yw/tmp> vi 20190606_175701_run.log
issue a) why has not included Jan? 17 Jan was free to be run since I can see the all the SIMs before were completed:
pr1ehu15@login1:/gpfs/scratch/pr1ehu00/pr1ehu15/a1yw/LOG_a1yw> ls -lrt *201201*_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_007_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_001_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_011_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_009_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_006_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_004_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_010_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_002_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_005_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_003_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 13:56 a1yw_20120101_008_17_SIM_COMPLETED
-rw-r--r-- 1 pr1ehu15 pr1ehu00 0 jun 8 14:26 a1yw_20120101_000_17_SIM_COMPLETED
issue b) this is wasting a lot of resources each time the Da is run: the first chunk is made of 12 SIMs the second chunk is made only of a DA job (which needs only 2500 cores instead of the 15552 reserved by the wrapper)
hence it finishes to run the full July and then prepares a wrapper for the rest of Jan dates not completed yet (i.e. from the 17 Jan) and here it comes the problem n.2
-second problematic wrapper
it prepares a wrapper asking for 2500 cores:
/gpfs/scratch/pr1ehu00/pr1ehu15/a1yw/LOG_a1yw> vi a1yw_ASThread_15600668827226_2500_183.cmd
or
/esarchive/autosubmit/a1yw/tmp> vi a1yw_ASThread_15600668827226_2500_183.cmd
I think the run.log is this one:
/esarchive/autosubmit/a1yw/tmp> vi 20190609_095333_run.log
the first chunk is a DA job that needs 2500 cores, the second chunk is 12 SIMs, where each SIM needs 1296 cores!
issue c) so this wrapper will never work.
It was one of those wrappers that got stuck because of the recent module issue #410 (closed), but I've run the wrapper manually in HPC just to prove that it would run a1yw_20120701_17_DA.cmd
and then fail.)
To summarize:
a) the main problem is that if it looses the ability to cross dates. not sure if it is as soon as there is a problem (I could not even find out if there was a failure) or systematically.
b) it creates wrappers that waste a lot of resources (if the SIMs go in the first chuck)
c) it creates wrappers that will never work (if DA goes in the first chunk)
is there a fix that could be applied to force in the first place the crossing of dates?