Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • autosubmit autosubmit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 338
    • Issues 338
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 21
    • Merge requests 21
  • Deployments
    • Deployments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • Earth SciencesEarth Sciences
  • autosubmitautosubmit
  • Issues
  • #812
Closed
Open
Issue created Apr 26, 2022 by dbeltran@dbeltranMaintainer

Fallback mechanism in the check_job_status

Seems that the information that we obtain to check the job status of the jobs can hang an experiment workflow if the command reports incorrect info. My proposal is to add one of the following points or both.

  • To implement a check_wallclock in the check_job process if a job is running for longer than its wallclock. ( This is already working in wrappers, it matter of extend it to normal jobs)

  • To implement a double command (if available) as a fallback, the issue with this is that each platform can have its own commands.

Also, yesterday in the process of recovering the platform, an experiment was hung while checking if the iteration with the platform was okay. I thought that command had a timeout by default(since it was working properly with all cases until yesterday) , seems that is not the case.

Edited Jul 26, 2022 by dbeltran
Assignee
Assign to
Time tracking