Autosubmit does not recognize jobs as completed when running in Docker
Blocking #822 (closed), #364 (closed) (cc @mcastril, @dbeltran )
Autosubmit Version
4.0.84 (latest in PyPI)
Expid affected(If applicable)
Any dummy
Which task has issues? Where is the log(If applicable)
-
Full_name: a000_LOCAL_SETUP
-
Log_Path: NA
Summary
I have a container recipe baked in !364 (merged)
autosubmit configure
, autosubmit install
, autosubmit describe
, autosubmit expid
, autosubmit create
, all work.
However, autosubmit run
stays forever looping with
...
8 of 8 jobs remaining (20:25)
Sleep: 10
Number of retrials: 0
Checking jobs for platform=local
Checking 1 jobs for platform local
Command 'nohup kill -0 735 > /dev/null 2>&1; echo $?': 0
Successful check job command: nohup kill -0 735 > /dev/null 2>&1; echo $?
Job a000_LOCAL_SETUP is RUNNING
Updating FAILED jobs
Updating WAITING jobs
Update finished
Saving JobList: /app/autosubmit/experiments/a000/pkl/job_list_a000.pkl
Job list saved
Reloading parameters...
Loading parameters...
Parameters load.
8 of 8 jobs remaining (20:26)
Sleep: 10
Number of retrials: 0
Checking jobs for platform=local
Checking 1 jobs for platform local
Command 'nohup kill -0 735 > /dev/null 2>&1; echo $?': 0
...
Steps to reproduce
Check out the branch, then
$ cd dockerfiles
$ docker run --rm -it --entrypoint /keygen.sh linuxserver/openssh-server
# Copy keys as per dockerfiles/README.md instructions
$ docker compose up --force-recreate --build --scale computing_node=2
$ docker exec -ti autosubmit /bin/bash
$ autosubmit expid -H local -d test --dummy
$ autosubmit create a000
$ autosubmit run a000
Before running, you can try ssh autosubmit@localhost
and that should work fine.
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit$ ssh autosubmit@localhost
Linux 5d6f8ab4d0e0 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Sun Aug 6 20:41:28 2023 from 127.0.0.1
What is the current bug behavior?
So I had a look at the source code, in the ParamikoPlatform
, and I think Autosubmit is running this command nohup kill -0 735 > /dev/null 2>&1; echo $?
, where 750
is a PID.
When I run this command in another session in the container, I don't see any error, and the exit code is 0
.
This is the output of ps
:
(base) autosubmit@5d6f8ab4d0e0:~$ ps aux | grep 735
autosub+ 735 0.0 0.0 0 0 pts/0 Z 20:17 0:00 [bash] <defunct>
autosub+ 974 0.0 0.0 3240 644 pts/2 S+ 20:46 0:00 grep 735
But modifying it to nohup kill -0 750 2>&1; echo $?
What is the expected correct behavior?
Experiment run successfully.
Relevant logs and/or screenshots(if applicable)
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ ls
a000_LOCAL_SETUP.cmd a000_LOCAL_SETUP.cmd.out.0 a000_LOCAL_SETUP_STAT
a000_LOCAL_SETUP.cmd.err.0 a000_LOCAL_SETUP_COMPLETED
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ cat a000_LOCAL_SETUP.cmd.err.0
job_name_ptrn='/app/autosubmit/experiments/a000/tmp/LOG_a000/a000_LOCAL_SETUP'
+ job_name_ptrn=/app/autosubmit/experiments/a000/tmp/LOG_a000/a000_LOCAL_SETUP
echo $(date +%s) > ${job_name_ptrn}_STAT
++ date +%s
+ echo 1691353033
###################
# Autosubmit job
###################
sleep 5
+ sleep 5
###################
# Autosubmit tailer
###################
set -xuve
+ set -xuve
echo $(date +%s) >> ${job_name_ptrn}_STAT
++ date +%s
+ echo 1691353038
touch ${job_name_ptrn}_COMPLETED
+ touch /app/autosubmit/experiments/a000/tmp/LOG_a000/a000_LOCAL_SETUP_COMPLETED
exit 0
+ exit 0
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ cat a000_LOCAL_SETUP.cmd.out.0
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ cat a000_LOCAL_SETUP_COMPLETED
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ cat a000_LOCAL_SETUP_STAT
1691353033
1691353038
Any other relevant information(if applicable)
I can run bash a000_LOCAL_SETUP.cmd
and it executes successfully, with exit 0
. But I couldn't find the output of the dummy job, which is weird.
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ ls
a000_LOCAL_SETUP.cmd a000_LOCAL_SETUP.cmd.out.0 a000_LOCAL_SETUP_STAT
a000_LOCAL_SETUP.cmd.err.0 a000_LOCAL_SETUP_COMPLETED
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ cat a000_LOCAL_SETUP.cmd.out.0
()