Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • autosubmit autosubmit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 338
    • Issues 338
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 21
    • Merge requests 21
  • Deployments
    • Deployments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • Earth SciencesEarth Sciences
  • autosubmitautosubmit
  • Issues
  • #1389
Closed
Open
Issue created Aug 12, 2024 by Bruno de Paula Kinoshita@bdepaulaMaintainer

Autosubmit 4.1.9 recover_job_logs function performance downgrade

Hello @dbeltran,

Autosubmit Version

4.1.9

Expid affected(If applicable)

Any

Which task has issues? Where is the log(If applicable)

Any with remote logs to be retrieved

Summary

The CSC team identified that there were several ps processes being launched by Autosubmit, basically DOS'ing the machine.

They identified the part of the code where this is coming from.

It appears that this is from our solution to the orphaned processes in 4.1.8 (which had a solution for another issue with logs in 4.1.7, and ditto the same for 4.1.6...).

One possible solution for that is to add a sleep period, so the ps processes are launched in a more spaced manner.

Another possible solution is the use of psutil or another Python lib/code to verify that the main process is still alive. Or, yet, rethink the design (yet again, I am afraid) of these thread/processes and make it so that the spawned thread/process "works", searching for logs, until the parent process tells it to stop. We use _exit I think, which I suspect would kill the main process and leave orphan processes... maybe that would be required for a change like this.

Anyway, it would be nice to have 4.1.10 out with a fix for this, and avoid a new issue caused by a change in our releases.

I haven't worked on this part of the code yet, but either @egarriga or I can help you if you need, @dbeltran.

Steps to reproduce

Launch several experiments, that will launch several log retrieval processes, that will launch even more ps commands.

What is the current bug behavior?

The performance indicators of computers running Autosubmit show a clear increase in CPU usage after 4.1.9.

What is the expected correct behavior?

No orphan/zombies, and performance is the same or better than previous versions.

Relevant logs and/or screenshots(if applicable)

This is from the message on Mattermost, to show how severe this change was:

We are seeing something like 20-30 ps calls per second on the machine which is hammering the machine.

Any other relevant information(if applicable)

IMHO, for the merge request to fix this, we must use code review, and do thorough testing. In past merge requests we skipped this due to time/deadlines, but it's risky as the testing suite only spots punctual issues in the workflow. Issues like this one are caught only running AS in an env like ClimateDT, or by scrutinizing the code in a code review.

()

Assignee
Assign to
Time tracking