added --no-requeue
-
Maintainer
👏 Thanks @dbeltran ! #TIL about that flag. I think Cylc just
sbatch
the file, with no extra flags like--no-requeue
or-D
. So in theory they would run into the same issue on LUMI, I think, unless the workflow dev or site admin configured the env var. Interesting!Ditto for ECMWF's Troika, https://github.com/ecmwf/troika/blob/7e19148e6a72f587b7030e39cb83b7a17a2b970d/src/troika/sites/slurm.py#L226-L234.
cc @mcastril
☝ -
Owner
Is there an issue in Autosubmit project for this problem? I didn't find it, so that I'll comment here.
Reading again the option in the docs:
Setting this option will prevent system administrators from being able to restart the job (for example, after a scheduled downtime), recover from a node failure, or be requeued upon preemption by a higher priority job.
I had doubts about what's the correct approach:
- Always use --no-requeue
or
- Find if there's a requeued job and track it
I think the first is safest (otherwise the user cannot control the number of retries) but the second would work in cases where AS has been switched off or the machine loses connectivity
-
Author Maintainer
Hello @mcastril ,
It is in the !395 (merged) merge branch, added already in 4.1 ( beta-dev11). The re-queue parameter was added
About the second, I added, before knowing the no-queue parameter, a way to detect duplicate jobs in the submission process. Once autosubmit submits a job, it will look by name in both failure and success if the job is duplicated
-
Owner
Thanks @dbeltran, that's helpful
The issue is that, by using
--no-requeue
we should eliminate any possibility of resubmission, which I think it's good because the user will have more control over the re-trials. On the other hand we miss the opportunity of re-tries done over maintenance. But I think it's ok to not have jobs re-tried on their own without AS in the loop.