Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • autosubmit autosubmit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 338
    • Issues 338
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 21
    • Merge requests 21
  • Deployments
    • Deployments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • Earth SciencesEarth Sciences
  • autosubmitautosubmit
  • Issues
  • #726
Closed
Open
Issue created Aug 17, 2021 by Miguel Castrillo@mcastrilOwner

Issues with a3w5

Hi @dbeltran. @rfernand reported an issue with stopping (to change the status of some tasks) and resuming this big experiment. According to the description, it seemed like Autosubmit was forgetting the status of RUNNING wrappers.

This is the status before the pause:

20210816_112254_run.log

2021-08-16 18:02:26,417 a3w5_20001101_fc00_6_SIM           17018865       RUNNING        marenostrum4        bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_7_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_8_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_9_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_10_SIM          17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:02:26,418 a3w5_20001101_fc00_11_SIM          17018865       QUEUING        marenostrum4        bsc_es

This is the status after the pause:

20210816_175704_run.log

2021-08-16 18:02:26,417 a3w5_20001101_fc00_6_SIM           17018865       RUNNING        marenostrum4        bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_7_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_8_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_9_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_10_SIM          17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:02:26,418 a3w5_20001101_fc00_11_SIM          17018865       QUEUING        marenostrum4        bsc_es

but shorty afterwards, in the same log:

2021-08-16 18:02:34,379 Successful check job command: sacct -n -X --jobs 17018865 -o "State"
2021-08-16 18:02:34,379 Wrapper job a3w5_ASThread_16281658676437_768_6 changed from SUBMITTED to status RUNNING
2021-08-16 18:02:34,379 Checking inner jobs status
2021-08-16 18:02:35,418 Job a3w5_20001101_fc00_1_SIM started at 2021-08-16 11:18:51
2021-08-16 18:02:35,418 Job a3w5_20001101_fc00_1_SIM is RUNNING
2021-08-16 18:02:35,433 a3w5_20001101_fc00_1_SIM_STAT file have been transfered
2021-08-16 18:02:36,555 a3w5_20001101_fc00_1_SIM job seems to have completed: checking...
2021-08-16 18:02:36,568 Job a3w5_20001101_fc00_1_SIM is COMPLETED
2021-08-16 18:02:36,585 a3w5_20001101_fc00_1_SIM_STAT file have been transfered
2021-08-16 18:02:36,701 Job a3w5_20001101_fc00_1_SIM finished at 2021-08-16 12:31:23

Autosubmit "forgets" about the running wrapper and resets its status to zero. Then it realizes that the inner jobs are already COMPLETED and starts setting their correct status.

This happened with all the SIMs until the fifth, and again:

2021-08-16 18:04:17,942 a3w5_20001101_fc00_6_SIM           17018865       RUNNING        marenostrum4        bsc_es
2021-08-16 18:04:17,942 a3w5_20001101_fc00_7_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:04:17,942 a3w5_20001101_fc00_8_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:04:17,942 a3w5_20001101_fc00_9_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:04:17,942 a3w5_20001101_fc00_10_SIM          17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:04:17,943 a3w5_20001101_fc00_11_SIM          17018865       QUEUING        marenostrum4        bsc_es

However, after SIM_6 is COMPLETED, Autosubmit does not proceed checking the running status of the 7th:

2021-08-16 18:32:31,189 Saving JobList: /esarchive/autosubmit/a3w5/pkl/job_list_a3w5_backup.pkl
2021-08-16 18:32:31,894 Job list saved
2021-08-16 18:32:32,205 Checking jobs for platform=marenostrum4
2021-08-16 18:32:32,206 Checking Wrapper 17018865
2021-08-16 18:32:35,278 Successful check job command: sacct -n -X --jobs 17018865 -o "State"
2021-08-16 18:32:35,278 Checking inner jobs status
2021-08-16 18:32:37,320 a3w5_20001101_fc00_6_SIM job seems to have completed: checking...
2021-08-16 18:32:37,334 Job a3w5_20001101_fc00_6_SIM is COMPLETED
2021-08-16 18:32:37,393 a3w5_20001101_fc00_6_SIM_STAT file have been transfered
2021-08-16 18:32:37,505 Job a3w5_20001101_fc00_6_SIM finished at 2021-08-16 18:32:21
2021-08-16 18:32:38,316 a3w5_20001101_fc00_7_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:32:38,317 a3w5_20001101_fc00_8_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:32:38,317 a3w5_20001101_fc00_9_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:32:38,317 a3w5_20001101_fc00_10_SIM          17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 18:32:38,317 a3w5_20001101_fc00_11_SIM          17018865       QUEUING        marenostrum4        bsc_es

The status is checked, but SIM_7 is never RUNNING

2021-08-16 18:32:54,576 Checking Wrapper 17018865
2021-08-16 18:32:57,798 Successful check job command: sacct -n -X --jobs 17018865 -o "State"

During the stop there was a setstatus and an interruption in the process:

20210816_175624_setstatus.log

Job Name                           Job Id         Job Status     Job Platform        Job Queue
2021-08-16 17:56:47,849 a3w5_19951101_fc00_11_NCTIME       558440         COMPLETED      nord3               bsc_es
2021-08-16 17:56:47,849 a3w5_20051101_fc00_11_NCTIME       556771         COMPLETED      nord3               bsc_es
2021-08-16 17:56:47,850 a3w5_20101101_fc00_11_NCTIME       556777         COMPLETED      nord3               bsc_es
2021-08-16 17:56:47,850 a3w5_20151101_fc00_11_NCTIME       558445         COMPLETED      nord3               bsc_es
2021-08-16 17:56:47,850 a3w5_20201101_fc00_11_NCTIME       558851         COMPLETED      nord3               bsc_es
2021-08-16 17:56:47,850 a3w5_20001101_fc00_6_SIM           17018865       RUNNING        marenostrum4        bsc_es
2021-08-16 17:56:47,850 a3w5_20001101_fc00_7_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 17:56:47,850 a3w5_20001101_fc00_8_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 17:56:47,850 a3w5_20001101_fc00_9_SIM           17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 17:56:47,850 a3w5_20001101_fc00_10_SIM          17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 17:56:47,851 a3w5_20001101_fc00_11_SIM          17018865       QUEUING        marenostrum4        bsc_es
2021-08-16 17:56:47,854 Saving JobList: /esarchive/autosubmit/a3w5/pkl/job_list_a3w5.pkl
2021-08-16 17:56:48,563 Job list saved
2021-08-16 17:56:48,829 Traceback (most recent call last):
  File "/shared/earth/software/autosubmit/3.13.0-foss-2015a-Python-2.7.9/lib/python2.7/site-packages/autosubmit-3.13.0-py2.7.egg/autosubmit/database/db_jobdata.py", line 915, in process_status_changes
    self.update_jobs_from_change_status(tracking_dictionary)
  File "/shared/earth/software/autosubmit/3.13.0-foss-2015a-Python-2.7.9/lib/python2.7/site-packages/autosubmit-3.13.0-py2.7.egg/autosubmit/database/db_jobdata.py", line 1573, in update_jobs_from_change_status
    status_code, final_status_code = tracking_dictionary[job_name]
ValueError: too many values to unpack

[WARNING] 2021-08-16 17:56:48,830 Autosubmit couldn't process status changes validate_current_run too many values to unpack

I don't know if it can be related, since it doesn't seem to have impacted the status of the jobs neither the wrapper that they belong to.

Edited Aug 20, 2021 by Roberto Bilbao
Assignee
Assign to
Time tracking