[enhancement] monitor and other read-only commands must not call functions that load backup
This issue should not affect the state of jobs, it's a different issue but related to what was reported in the production runs issue.
Autosubmit Version
master
as of this morning.
$ git show
commit a957048e894de8848d746ab3805592ccf66734ac (HEAD -> master, upstream/master, upstream/HEAD)
Merge: db907829 a081ff73
Author: Daniel Beltran Mora <daniel.beltran@bsc.es>
Date: Thu Mar 7 15:47:41 2024 +0100
Merge branch 'het-job-message' into 'master'
Fixes "attempt to cancel an het job" failure message
Closes #1245
See merge request es/autosubmit!412
Summary
autosubmit monitor
loads backup of jobs to display in the plot or text file, which can be misleading.
I guess users would prefer to be alerted that the command failed instead, and use GUI or ask for
help (worse, they may not notice the warning, and get confused with the statuses displayed).
Steps to reproduce
- Simulate an IO error on LUMI
diff --git a/autosubmit/job/job_list.py b/autosubmit/job/job_list.py
index 9f5fbfcc..befad310 100644
--- a/autosubmit/job/job_list.py
+++ b/autosubmit/job/job_list.py
@@ -2321,6 +2321,7 @@ class JobList(object):
"""
Log.info("Loading JobList")
try:
+ raise IOError('Something IO happened on LUMI')
return self._persistence.load(self._persistence_path, self._persistence_file)
except AutosubmitCritical:
raise
- Pretend you have a backup with some old jobs, maybe several chunks ago?...
$ cp ~/autosubmit/a002/pkl/job_list_a002.pkl ~/autosubmit/a002/pkl/job_list_a002_backup.pkl
- Run the command they reported in the production runs issue.
$ autosubmit monitor a002 -o txt
Autosubmit is running with 4.1.0
Getting job list...
Checking configuration files...
Expdef config file is correct
Platforms sections: OK
Jobs sections OK
Autosubmit general sections OK
Configuration files OK
Loading JobList
[ERROR] Autosubmit will use a backup for recover the job_list[eCode=6010]
Loading backup JobList
Creating jobs...
Load finished
Adding dependencies to the graph..
Adding dependencies to the job..
Transitive reduction...
Looking for edgeless jobs...
Writing status txt...
Status txt created at <_io.TextIOWrapper name='/home/kinow/autosubmit/a002/status/a002_20240308_0856.txt' mode='w+' encoding='UTF-8'>
The message is very similar to what they had, with the difference that now we do not replace the pickle file, so the jobs should not be in WAITING
, but the output of monitor
might be inconsistent.
What is the current bug behavior?
autosubmit monitor $expid
may load the backup to display that to users, but it is probably better to show an error to the user.
What is the expected correct behavior?
An error, with our AS error code, and maybe a stack trace so users know which error it was (FileExists, PermissionError, FileNotFound, etc.) as that would help them (and us) to troubleshoot it.
Relevant logs and/or screenshots(if applicable)
Here's what happens with autosubmit monitor -o txt $expid
right now.
bin/autosubmit.py
-> autosubmit/autosubmit.py, function monitor()
-> autosubmit/autosubmit.py, function load_job_list()
-> autosubmit/autosubmit.py, function generate()
-> autosubmit/job/job_list.py, function load() # Where "Loading JobList" message is printed...
The code of the function above is exactly this on master
right now:
Log.info("Loading JobList")
try:
return self._persistence.load(self._persistence_path, self._persistence_file)
except AutosubmitCritical:
raise
except:
Log.printlog(
"Autosubmit will use a backup for recover the job_list", 6010)
return self.backup_load()
So we print some logs, and then try to load the JobList. If it fails, and if the error raised was raised by Autosubmit (raising an AutosubmitCritical), then we do not load the back, and instead raise the error.
If the error is a BlockingIOError, BrokenPipeError, FileExistsError,
(all I/O errors that could be raised in LUMI) then we go into the
except
part, and load the backup.
I think it's alright to load the backup here. But for the monitor
command, I think it does not make sense to load any backup. Instead
we must fail with a message like "Failed to load the job list: here's your error,
here's your stack trace, and a code to search our docs", etc.
And instead of trying to handle the exceptions caught/raised, or
workaround the I/O errors, it's much safer to have a flag to not
load the backup at all, or a separate function, used by monitor
and any other read-only commands.
Cheers
Any other relevant information(if applicable)
()