Time limit failure after a failure of the platform
Hello @dbeltran and @bdepaula,
Autosubmit Version
4.0.104 (taken from the minimal.yml, Jeisson to confirm)
Expid affected(If applicable)
a12l
Which task has issues? Where is the log(If applicable)
-
Full_name: a12l_20091024_fc0_4_1_OPA
-
Log_Path: /appl/AS/AUTOSUBMIT_DATA/a12l/tmp/LOG_a12l/a12l_20091024_fc0_4_1_OPA.20240311105238.err
Summary
Job killed due to time limit after platform failure [ERROR] Submission failed, this can be due a failure on the platform [eCode=6015]
Steps to reproduce
- apparently error from machine, not reproducible
What is the current bug behavior?
After that error, autosubmit is checking config files again, but then the experiments that were already submitted do not run anymore, until they reach the wallclock limit, when they are set to failed.
What is the expected correct behavior?
According to documentation, this comes from a failed submission, but apparently the jobs were already submitted. I expected an automatic solution from AS:
Relevant logs and/or screenshots(if applicable)
[ERROR] Trace: Jobs_id []
[ERROR] Submission failed, this can be due a failure on the platform [eCode=6015]
Storing failed job count...
Storing failed job count...done
Waiting 15 seconds before continue
Checking configuration files...
Platforms sections: OK
Jobs sections OK
Autosubmit general sections OK
and then
+ load_environment_gsv /projappl/project_465000454/ a0fe
+ set +xuve
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) ModuleLabel/label 6) craype-accel-host
2) lumi-tools/23.11 7) libfabric/1.15.2.0
3) init-lumi/0.2 8) craype-network-ofi
4) LUMI/23.09 9) xpmem/2.5.2-2.4_3.50__gd0f7936.shasta
5) craype-x86-milan 10) partition/C
The following sticky modules could not be reloaded:
1) lumi-tools
slurmstepd: error: *** JOB 6428756 ON nid002595 CANCELLED AT 2024-03-11T11:11:54 DUE TO TIME LIMIT ***
Any other relevant information(if applicable)
Full report from @jjavier here
()