Autosubmit docker container

Autosubmit Version

Me and Simon Lyobard (slyobard@mercator-ocean.fr), created a docker img based on the autosubmit docker img autosubmit/autosubmit:4.1.7, and run the container on the edito platform (here)

Summary

Starting from the autosubmit/autosubmit:4.1.7 image we add a script that creates an experiment from a git minimal configuration. The git repo and branch are two ENV variables, so that we can build different experiments just changing these two variables.

The image will contain the created experiment (always a000) . At the entrypoint we run the experiment (autosubmit run a000) ( plus some additional command to inject in the plaform file the user, the project number and the scratch path).

Steps to reproduce

I attach here the zip with all the needed files to build the docker image image-content.zip. The git repo we are using as an example is this one.

What is the current bug behavior?

At run time we have every times a different behaviour, which I can't explain. Some of the seen errors:

It does not start: it fails with an error regarding the platform file, it could not connect. But the platform files are ok.
It does connect to LUMI without any problem but it stays indefinitely on the local setup without even creating the logs (nor the tar file it is expected to create).
The local set-up is completed, creating the log files as expected, but the remote set up never starts (also no logs).

Different runs reproduced one of these behaviors. We were never able to go beyond the local set-up.

I tried to run an "autosubmit inspect a000" inside the container, and I can't see any file in a000/tmp where I was expecting to see all the .cmd files. Also the plot folder is empty (I am not sure this is not expected, could be due to the fact we don't have an x11 in the img).

Any other relevant information(if applicable)

Everything after the local set-up is run on LUMI, so I think the problem is that autosubmit (due to the interaction with the container? with Kubernetis?) can't retrieve the status of the job and can not trigger the remote script.

The entire workflow is very well tested with my user on my account with a normal autosubmit installation, so I would tend to exclude problems with the workflow/configuration files.

A part from the last job the workflow is doing very basic stuff: downloading data, modifying config files, etc.