Multi Wrapper setup doesn't take into account the policy + ( error message wrong )
Hello @dbeltran and @bdepaula,
Autosubmit Version
v3.15.0b
Expid affected
a5ls
Which task has issues? Where is the log?
The issue is with the wrappers, and also the submission of the individual jobs (this part was already reported on #1062 (closed)).
Summary
I'm currently running a multi-member multi-startdate (2 of each) experiment in MeluXina to test the working of the horizontal wrappers. I set up a multi-wrapper with a vertical one for the SIM jobs to mimic the expected use in general.
The issue arises when the first SIM job finishes (there are 4 running in parallel in their respective vertical wrappers due to the fast queuing of this machine, but this won't be the usual behavior), and then there are 2 new jobs (both in the horizontal wrapper) available. The wrappers have this settings:
[wrapper]
TYPE = multi
WRAPPER_LIST = SIM_WRAPPER,CMOR_WRAPPER
[SIM_WRAPPER]
TYPE = vertical
JOBS_IN_WRAPPER = SIM
MIN_WRAPPED = 3
MAX_WRAPPED = 10
POLICY = flexible #default is flexible. Values: flexible,strict,mixed
[CMOR_WRAPPER]
TYPE = horizontal
JOBS_IN_WRAPPER = CMORATM&CMOROCE&SAVEIC&POST
MIN_WRAPPED = 4
MAX_WRAPPED = 20
POLICY = mixed #default is flexible. Values: flexible,strict,mixed
Which should make these 2 jobs wait for another 2 before being submitted together with a wrapper since they have the mixed policy. But for some reason, the log says that the wrapper has flexible policy and submits them independently. This crashes autosubmit since the SAVEIC job has the following header:
###############################################################################
# SAVEIC a5ls EXPERIMENT
###############################################################################
#
#SBATCH --qos=default
#SBATCH -A p200125
#
#SBATCH --cpus-per-task=12
#SBATCH --tasks-per-node=128
#
#SBATCH -n 1
#SBATCH -t 00:40:00
#SBATCH -J a5ls_19900201_fc00_1_SAVEIC
#SBATCH --output=/project/home/p200125/u100498/a5ls/LOG_a5ls/a5ls_19900201_fc00_1_SAVEIC.cmd.out.0
#SBATCH --error=/project/home/p200125/u100498/a5ls/LOG_a5ls/a5ls_19900201_fc00_1_SAVEIC.cmd.err.0
#SBATCH -N 1
#SBATCH --hint=nomultithread
#
#
###############################################################################
And it conflicts the --cpus-per-task
, the --tasks-per.node
and the reserved nodes. This crash comes from the issue mentioned (#1062 (closed)), where it introduces some wrong lines on the job header that were for the wrappers settings.
But the main objective of this issue, is to discern why it is considering the horizontal wrapper as having flexible
policy.
Steps to reproduce
Copying and running the a5ls
experiment should show the issue (but it requires access to MeluXina). Another way would be to generate a multi-member multi-startdate experiment, with a multi-wrapper configuration (with flexible and mixed policies) and that the mixed policy one gets less jobs than the minimum as ready to be submitted. Then, if it behaves as in this case, it will say that there is a deadlock and that due to the flexible policy it will submit the jobs independently.
What is the current bug behavior?
Here I add the output of the autosubmit run
command, the crash as I mention is due to another issue, and the problem at hand is the line regarding the flexible policy.
717 of 724 jobs remaining (12:45)
Job a5ls_20000201_fc00_1_SIM is RUNNING
Job a5ls_20000201_fc01_1_SIM is RUNNING
Job a5ls_19900201_fc01_1_SIM is RUNNING
Job a5ls_19900201_fc00_1_SIM is COMPLETED
Job a5ls_19900201_fc00_1_SIM finished at 2023-07-21 12:45:59
Wrapper policy is set to flexible and there is a deadlock, As will submit the jobs sequentally
[ERROR] Trace: Command /project/home/p200125/u100498/a5ls/LOG_a5ls/submit_meluxina.sh in meluxina_autosubmit warning: sbatch: error: Batch job submission failed: Requested node configuration is not available
[CRITICAL] Check that meluxina platform has set the correct scheduler. Sections that could be affected: SAVEIC&CMOROCE [eCode=7014]
More info at https://autosubmit.readthedocs.io/en/v3.15.0/faq.html
What is the expected correct behavior?
As I mention, the crash is not (fully) related to this issue, only the part affecting the wrappers policies.
The expected behavior would be for the CMOROCE and SAVEIC jobs to wait until a wrapper of the required dimensions can be submitted.
Relevant logs and/or screenshots
Everything is already added on the other parts of the issue, but if you think any other log can be useful I can add them later.
Any other relevant information
After talking about this with Dani, we were thinking that the problem comes from the multi-wrapper setting, that for some reason is not taking into account the policy, but it would be needed to discern if it is taking the default (flexible) for both wrappers, or not, since one of them is already using it. So he mentioned that it could be worth to test if setting both wrappers to mixed it detects that policy for the vertical one. Or maybe (my thoughts), that there is only taking into account the first policy set up in the conf file, and then setting it for all the wrappers.