Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • autosubmit autosubmit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 338
    • Issues 338
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 21
    • Merge requests 21
  • Deployments
    • Deployments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • Earth SciencesEarth Sciences
  • autosubmitautosubmit
  • Issues
  • #116
Closed
Open
Issue created May 10, 2016 by François Massonnet@fmassonnet

Job tagged FAILED while it is completed on the remote

Hey,

I'm using autosubmit 3.3.1. and running my autosubmit command on bscesautosubmit01 as:

expid=a065; export SAGA_PTY_SSH_TIMEOUT=120; nohup autosubmit run ${expid} >& ~/logs/autosubmit/log_${expid} &

It happens regularly that one of my jobs finishes successfully on the HPC but is marked as failed by autosubmit, which causes everything to stop.

Example: Experiment a064: the job a065_19930101_2_ENKF finished successfully (file /gpfs/scratch/cns96/cns96177/a064/LOG_a064/a064_19930101_2_ENKF_COMPLETED is well created) but autosubmit stopped (see autosubmit log file: /home/Earth/fmassonn/logs/autosubmit/log_a064).

It looks like this is related to SSH connection issues: the end of the log file:

File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/site-packages/saga_python-0.40-py2.7.egg/saga/adaptors/shell/shell_file.py", line 274, in _shell_creator return sups.PTYShell (url, self.get_session(), self._logger) File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/site-packages/saga_python-0.40-py2.7.egg/saga/utils/pty_shell.py", line 243, in init posix=self.posix) File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/site-packages/saga_python-0.40-py2.7.egg/saga/utils/pty_shell_factory.py", line 209, in initialize "Lost shell connection to %s" % info['host_str']) IncorrectState: Lost shell connection to mn-cns96 (/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/site-packages/saga_python-0.40-py2.7.egg/saga/utils/pty_shell_factory.py +209 (initialize) : "Lost shell connection to %s" % info['host_str'])) No more jobs to run. Some jobs have failed and reached maximun retrials

I have to fix this manually each time, so I would appreciate any help. François

Assignee
Assign to
Time tracking