added --no-requeue (89b8d21a) · Commits · Earth Sciences / autosubmit

👏

Thanks @dbeltran ! #TIL about that flag. I think Cylc just sbatch the file, with no extra flags like --no-requeue or -D. So in theory they would run into the same issue on LUMI, I think, unless the workflow dev or site admin configured the env var. Interesting!

https://github.com/cylc/cylc-flow/blob/7a813eb3c47248490557882279988bf7c8303583/cylc/flow/job_runner_handlers/slurm.py#L130

Ditto for ECMWF's Troika, https://github.com/ecmwf/troika/blob/7e19148e6a72f587b7030e39cb83b7a17a2b970d/src/troika/sites/slurm.py#L226-L234.

cc @mcastril ☝

mentioned in issue #897

Is there an issue in Autosubmit project for this problem? I didn't find it, so that I'll comment here.

Reading again the option in the docs:

Setting this option will prevent system administrators from being able to restart the job (for example, after a scheduled downtime), recover from a node failure, or be requeued upon preemption by a higher priority job.

I had doubts about what's the correct approach:

Always use --no-requeue

or

Find if there's a requeued job and track it

I think the first is safest (otherwise the user cannot control the number of retries) but the second would work in cases where AS has been switched off or the machine loses connectivity

Hello @mcastril ,

It is in the !395 (merged) merge branch, added already in 4.1 ( beta-dev11). The re-queue parameter was added

About the second, I added, before knowing the no-queue parameter, a way to detect duplicate jobs in the submission process. Once autosubmit submits a job, it will look by name in both failure and success if the job is duplicated

Thanks @dbeltran, that's helpful

The issue is that, by using --no-requeue we should eliminate any possibility of resubmission, which I think it's good because the user will have more control over the re-trials. On the other hand we miss the opportunity of re-tries done over maintenance. But I think it's ok to not have jobs re-tried on their own without AS in the loop.