Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • autosubmit autosubmit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 338
    • Issues 338
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 21
    • Merge requests 21
  • Deployments
    • Deployments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • Earth SciencesEarth Sciences
  • autosubmitautosubmit
  • Issues
  • #1106
Closed
Open
Issue created Aug 06, 2023 by Bruno de Paula Kinoshita@bdepaulaMaintainer

Autosubmit does not recognize jobs as completed when running in Docker

Blocking #822 (closed), #364 (closed) (cc @mcastril, @dbeltran )

Autosubmit Version

4.0.84 (latest in PyPI)

Expid affected(If applicable)

Any dummy

Which task has issues? Where is the log(If applicable)

  • Full_name: a000_LOCAL_SETUP

  • Log_Path: NA

Summary

I have a container recipe baked in !364 (merged)

autosubmit configure, autosubmit install, autosubmit describe, autosubmit expid, autosubmit create, all work.

However, autosubmit run stays forever looping with

...
8 of 8 jobs remaining (20:25)
Sleep: 10
Number of retrials: 0
Checking jobs for platform=local
Checking 1 jobs for platform local
Command 'nohup kill -0 735 > /dev/null 2>&1; echo $?': 0

Successful check job command: nohup kill -0 735 > /dev/null 2>&1; echo $?
Job a000_LOCAL_SETUP is RUNNING
Updating FAILED jobs
Updating WAITING jobs
Update finished
Saving JobList: /app/autosubmit/experiments/a000/pkl/job_list_a000.pkl
Job list saved
Reloading parameters...
Loading parameters...
Parameters load.


8 of 8 jobs remaining (20:26)
Sleep: 10
Number of retrials: 0
Checking jobs for platform=local
Checking 1 jobs for platform local
Command 'nohup kill -0 735 > /dev/null 2>&1; echo $?': 0
...

Steps to reproduce

Check out the branch, then

$ cd dockerfiles
$ docker run --rm -it --entrypoint /keygen.sh linuxserver/openssh-server
# Copy keys as per dockerfiles/README.md instructions
$ docker compose up --force-recreate --build --scale computing_node=2
$ docker exec -ti autosubmit /bin/bash
$ autosubmit expid -H local -d test --dummy
$ autosubmit create a000
$ autosubmit run a000

Before running, you can try ssh autosubmit@localhost and that should work fine.

(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit$ ssh autosubmit@localhost
Linux 5d6f8ab4d0e0 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Sun Aug  6 20:41:28 2023 from 127.0.0.1

What is the current bug behavior?

So I had a look at the source code, in the ParamikoPlatform, and I think Autosubmit is running this command nohup kill -0 735 > /dev/null 2>&1; echo $?, where 750 is a PID.

When I run this command in another session in the container, I don't see any error, and the exit code is 0.

This is the output of ps:

(base) autosubmit@5d6f8ab4d0e0:~$ ps aux | grep 735
autosub+     735  0.0  0.0      0     0 pts/0    Z    20:17   0:00 [bash] <defunct>
autosub+     974  0.0  0.0   3240   644 pts/2    S+   20:46   0:00 grep 735

But modifying it to nohup kill -0 750 2>&1; echo $?

What is the expected correct behavior?

Experiment run successfully.

Relevant logs and/or screenshots(if applicable)

(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ ls
a000_LOCAL_SETUP.cmd        a000_LOCAL_SETUP.cmd.out.0  a000_LOCAL_SETUP_STAT
a000_LOCAL_SETUP.cmd.err.0  a000_LOCAL_SETUP_COMPLETED
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ cat a000_LOCAL_SETUP.cmd.err.0
job_name_ptrn='/app/autosubmit/experiments/a000/tmp/LOG_a000/a000_LOCAL_SETUP'
+ job_name_ptrn=/app/autosubmit/experiments/a000/tmp/LOG_a000/a000_LOCAL_SETUP
echo $(date +%s) > ${job_name_ptrn}_STAT
++ date +%s
+ echo 1691353033

###################
# Autosubmit job
###################

sleep 5
+ sleep 5
###################
# Autosubmit tailer
###################
set -xuve
+ set -xuve
echo $(date +%s) >> ${job_name_ptrn}_STAT
++ date +%s
+ echo 1691353038
touch ${job_name_ptrn}_COMPLETED
+ touch /app/autosubmit/experiments/a000/tmp/LOG_a000/a000_LOCAL_SETUP_COMPLETED
exit 0
+ exit 0
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ cat a000_LOCAL_SETUP.cmd.out.0
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ cat a000_LOCAL_SETUP_COMPLETED
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ cat a000_LOCAL_SETUP_STAT
1691353033
1691353038

Any other relevant information(if applicable)

I can run bash a000_LOCAL_SETUP.cmd and it executes successfully, with exit 0. But I couldn't find the output of the dummy job, which is weird.

(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ ls
a000_LOCAL_SETUP.cmd        a000_LOCAL_SETUP.cmd.out.0  a000_LOCAL_SETUP_STAT
a000_LOCAL_SETUP.cmd.err.0  a000_LOCAL_SETUP_COMPLETED
(base) autosubmit@5d6f8ab4d0e0:/app/autosubmit/experiments/a000/tmp/LOG_a000$ cat a000_LOCAL_SETUP.cmd.out.0

()

Edited Aug 06, 2023 by Bruno de Paula Kinoshita
Assignee
Assign to
Time tracking