Job tagged FAILED while it is completed on the remote
Hey,
I'm using autosubmit 3.3.1. and running my autosubmit command on bscesautosubmit01 as:
expid=a065; export SAGA_PTY_SSH_TIMEOUT=120; nohup autosubmit run ${expid} >& ~/logs/autosubmit/log_${expid} &
It happens regularly that one of my jobs finishes successfully on the HPC but is marked as failed by autosubmit, which causes everything to stop.
Example: Experiment a064: the job a065_19930101_2_ENKF finished successfully (file /gpfs/scratch/cns96/cns96177/a064/LOG_a064/a064_19930101_2_ENKF_COMPLETED is well created) but autosubmit stopped (see autosubmit log file: /home/Earth/fmassonn/logs/autosubmit/log_a064).
It looks like this is related to SSH connection issues: the end of the log file:
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/site-packages/saga_python-0.40-py2.7.egg/saga/adaptors/shell/shell_file.py", line 274, in _shell_creator return sups.PTYShell (url, self.get_session(), self._logger) File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/site-packages/saga_python-0.40-py2.7.egg/saga/utils/pty_shell.py", line 243, in init posix=self.posix) File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/site-packages/saga_python-0.40-py2.7.egg/saga/utils/pty_shell_factory.py", line 209, in initialize "Lost shell connection to %s" % info['host_str']) IncorrectState: Lost shell connection to mn-cns96 (/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/site-packages/saga_python-0.40-py2.7.egg/saga/utils/pty_shell_factory.py +209 (initialize) : "Lost shell connection to %s" % info['host_str'])) No more jobs to run. Some jobs have failed and reached maximun retrials
I have to fix this manually each time, so I would appreciate any help. François