Commit 89b8d21a authored by dbeltran's avatar dbeltran
Browse files

added --no-requeue

parent da2b04f6
  • 👏

    Thanks @dbeltran ! #TIL about that flag. I think Cylc just sbatch the file, with no extra flags like --no-requeue or -D. So in theory they would run into the same issue on LUMI, I think, unless the workflow dev or site admin configured the env var. Interesting!

    https://github.com/cylc/cylc-flow/blob/7a813eb3c47248490557882279988bf7c8303583/cylc/flow/job_runner_handlers/slurm.py#L130

    Ditto for ECMWF's Troika, https://github.com/ecmwf/troika/blob/7e19148e6a72f587b7030e39cb83b7a17a2b970d/src/troika/sites/slurm.py#L226-L234.

    cc @mcastril

  • mentioned in issue #897

    Toggle commit list
  • Is there an issue in Autosubmit project for this problem? I didn't find it, so that I'll comment here.

    Reading again the option in the docs:

    Setting this option will prevent system administrators from being able to restart the job (for example, after a scheduled downtime), recover from a node failure, or be requeued upon preemption by a higher priority job.

    I had doubts about what's the correct approach:

    • Always use --no-requeue

    or

    • Find if there's a requeued job and track it

    I think the first is safest (otherwise the user cannot control the number of retries) but the second would work in cases where AS has been switched off or the machine loses connectivity

  • Author Maintainer

    Hello @mcastril ,

    It is in the !395 (merged) merge branch, added already in 4.1 ( beta-dev11). The re-queue parameter was added

    About the second, I added, before knowing the no-queue parameter, a way to detect duplicate jobs in the submission process. Once autosubmit submits a job, it will look by name in both failure and success if the job is duplicated

  • Thanks @dbeltran, that's helpful

    The issue is that, by using --no-requeue we should eliminate any possibility of resubmission, which I think it's good because the user will have more control over the re-trials. On the other hand we miss the opportunity of re-tries done over maintenance. But I think it's ok to not have jobs re-tried on their own without AS in the loop.

Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment