Issues with a3w5
Hi @dbeltran. @rfernand reported an issue with stopping (to change the status of some tasks) and resuming this big experiment. According to the description, it seemed like Autosubmit was forgetting the status of RUNNING wrappers.
This is the status before the pause:
20210816_112254_run.log
2021-08-16 18:02:26,417 a3w5_20001101_fc00_6_SIM 17018865 RUNNING marenostrum4 bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_7_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_8_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_9_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_10_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:02:26,418 a3w5_20001101_fc00_11_SIM 17018865 QUEUING marenostrum4 bsc_es
This is the status after the pause:
20210816_175704_run.log
2021-08-16 18:02:26,417 a3w5_20001101_fc00_6_SIM 17018865 RUNNING marenostrum4 bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_7_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_8_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_9_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:02:26,417 a3w5_20001101_fc00_10_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:02:26,418 a3w5_20001101_fc00_11_SIM 17018865 QUEUING marenostrum4 bsc_es
but shorty afterwards, in the same log:
2021-08-16 18:02:34,379 Successful check job command: sacct -n -X --jobs 17018865 -o "State"
2021-08-16 18:02:34,379 Wrapper job a3w5_ASThread_16281658676437_768_6 changed from SUBMITTED to status RUNNING
2021-08-16 18:02:34,379 Checking inner jobs status
2021-08-16 18:02:35,418 Job a3w5_20001101_fc00_1_SIM started at 2021-08-16 11:18:51
2021-08-16 18:02:35,418 Job a3w5_20001101_fc00_1_SIM is RUNNING
2021-08-16 18:02:35,433 a3w5_20001101_fc00_1_SIM_STAT file have been transfered
2021-08-16 18:02:36,555 a3w5_20001101_fc00_1_SIM job seems to have completed: checking...
2021-08-16 18:02:36,568 Job a3w5_20001101_fc00_1_SIM is COMPLETED
2021-08-16 18:02:36,585 a3w5_20001101_fc00_1_SIM_STAT file have been transfered
2021-08-16 18:02:36,701 Job a3w5_20001101_fc00_1_SIM finished at 2021-08-16 12:31:23
Autosubmit "forgets" about the running wrapper and resets its status to zero. Then it realizes that the inner jobs are already COMPLETED and starts setting their correct status.
This happened with all the SIMs until the fifth, and again:
2021-08-16 18:04:17,942 a3w5_20001101_fc00_6_SIM 17018865 RUNNING marenostrum4 bsc_es
2021-08-16 18:04:17,942 a3w5_20001101_fc00_7_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:04:17,942 a3w5_20001101_fc00_8_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:04:17,942 a3w5_20001101_fc00_9_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:04:17,942 a3w5_20001101_fc00_10_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:04:17,943 a3w5_20001101_fc00_11_SIM 17018865 QUEUING marenostrum4 bsc_es
However, after SIM_6 is COMPLETED, Autosubmit does not proceed checking the running status of the 7th:
2021-08-16 18:32:31,189 Saving JobList: /esarchive/autosubmit/a3w5/pkl/job_list_a3w5_backup.pkl
2021-08-16 18:32:31,894 Job list saved
2021-08-16 18:32:32,205 Checking jobs for platform=marenostrum4
2021-08-16 18:32:32,206 Checking Wrapper 17018865
2021-08-16 18:32:35,278 Successful check job command: sacct -n -X --jobs 17018865 -o "State"
2021-08-16 18:32:35,278 Checking inner jobs status
2021-08-16 18:32:37,320 a3w5_20001101_fc00_6_SIM job seems to have completed: checking...
2021-08-16 18:32:37,334 Job a3w5_20001101_fc00_6_SIM is COMPLETED
2021-08-16 18:32:37,393 a3w5_20001101_fc00_6_SIM_STAT file have been transfered
2021-08-16 18:32:37,505 Job a3w5_20001101_fc00_6_SIM finished at 2021-08-16 18:32:21
2021-08-16 18:32:38,316 a3w5_20001101_fc00_7_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:32:38,317 a3w5_20001101_fc00_8_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:32:38,317 a3w5_20001101_fc00_9_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:32:38,317 a3w5_20001101_fc00_10_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 18:32:38,317 a3w5_20001101_fc00_11_SIM 17018865 QUEUING marenostrum4 bsc_es
The status is checked, but SIM_7 is never RUNNING
2021-08-16 18:32:54,576 Checking Wrapper 17018865
2021-08-16 18:32:57,798 Successful check job command: sacct -n -X --jobs 17018865 -o "State"
During the stop there was a setstatus and an interruption in the process:
20210816_175624_setstatus.log
Job Name Job Id Job Status Job Platform Job Queue
2021-08-16 17:56:47,849 a3w5_19951101_fc00_11_NCTIME 558440 COMPLETED nord3 bsc_es
2021-08-16 17:56:47,849 a3w5_20051101_fc00_11_NCTIME 556771 COMPLETED nord3 bsc_es
2021-08-16 17:56:47,850 a3w5_20101101_fc00_11_NCTIME 556777 COMPLETED nord3 bsc_es
2021-08-16 17:56:47,850 a3w5_20151101_fc00_11_NCTIME 558445 COMPLETED nord3 bsc_es
2021-08-16 17:56:47,850 a3w5_20201101_fc00_11_NCTIME 558851 COMPLETED nord3 bsc_es
2021-08-16 17:56:47,850 a3w5_20001101_fc00_6_SIM 17018865 RUNNING marenostrum4 bsc_es
2021-08-16 17:56:47,850 a3w5_20001101_fc00_7_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 17:56:47,850 a3w5_20001101_fc00_8_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 17:56:47,850 a3w5_20001101_fc00_9_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 17:56:47,850 a3w5_20001101_fc00_10_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 17:56:47,851 a3w5_20001101_fc00_11_SIM 17018865 QUEUING marenostrum4 bsc_es
2021-08-16 17:56:47,854 Saving JobList: /esarchive/autosubmit/a3w5/pkl/job_list_a3w5.pkl
2021-08-16 17:56:48,563 Job list saved
2021-08-16 17:56:48,829 Traceback (most recent call last):
File "/shared/earth/software/autosubmit/3.13.0-foss-2015a-Python-2.7.9/lib/python2.7/site-packages/autosubmit-3.13.0-py2.7.egg/autosubmit/database/db_jobdata.py", line 915, in process_status_changes
self.update_jobs_from_change_status(tracking_dictionary)
File "/shared/earth/software/autosubmit/3.13.0-foss-2015a-Python-2.7.9/lib/python2.7/site-packages/autosubmit-3.13.0-py2.7.egg/autosubmit/database/db_jobdata.py", line 1573, in update_jobs_from_change_status
status_code, final_status_code = tracking_dictionary[job_name]
ValueError: too many values to unpack
[WARNING] 2021-08-16 17:56:48,830 Autosubmit couldn't process status changes validate_current_run too many values to unpack
I don't know if it can be related, since it doesn't seem to have impacted the status of the jobs neither the wrapper that they belong to.