Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • autosubmit autosubmit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 338
    • Issues 338
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 21
    • Merge requests 21
  • Deployments
    • Deployments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • Earth SciencesEarth Sciences
  • autosubmitautosubmit
  • Merge requests
  • !531

Fixed DB test # Means to fix the failed jobs not being written in the db due the logs not being retrieved

  • Review changes

  • Download
  • Patches
  • Plain diff
Open dbeltran requested to merge 4.1.12-dev-branch-failed-jobs into 4.1.12-dev-branch Dec 09, 2024
  • Overview 26
  • Commits 7
  • Pipelines 0
  • Changes 6

This fixes:

  • !527 (merged) , in line with this merge, now a job can't be put in the UniqueQueue if it has no ID.

  • Fixed an issue with failed jobs when chunks are> 3, retrials are> 5, and the job fulfills the 3 retrials.

The fix consists of adding a copy(job) in the put method.

I really thought that was what you put in the multiprocess.Queue() is always a copy of your actual instance. However, I saw that it was affecting the jobs in the Queue.

Since I have to add a copy ( not a deep one, tho, so it shouldn't be expensive), an alternative would be to pass job.id, fail_count, out , err, status, and create a new job in the log_recovery process.

  • Reorganized the log retrievals in case of failure

Now, instead of reconnecting to the platform and not going through each job in the queue, AS will get all the jobs from the queue and then reconnect if any job has an issue.

This avoids the issue of not getting the rest of the jobs that really have err and out files at the end of the main process

  • Now the check of if the job is recovered or not, is only done at the end and at the start instead of in the middle.

Tomorrow, I'll add to the regression test, a check for the log retrieval names

  • Removes unused stuff

  • Pipeline should now always works, the nature of the issue was tied to the CPU clock I believe.

Edited Dec 09, 2024 by dbeltran
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: 4.1.12-dev-branch-failed-jobs