[enhancement] monitor and other read-only commands must not call functions that load backup

Hello @dbeltran , @mcastril ,

This issue should not affect the state of jobs, it's a different issue but related to what was reported in the production runs issue.

Autosubmit Version

master as of this morning.

$ git show
commit a957048e894de8848d746ab3805592ccf66734ac (HEAD -> master, upstream/master, upstream/HEAD)
Merge: db907829 a081ff73
Author: Daniel Beltran Mora <daniel.beltran@bsc.es>
Date:   Thu Mar 7 15:47:41 2024 +0100

    Merge branch 'het-job-message' into 'master'
    
    Fixes "attempt to cancel an het job" failure message
    
    Closes #1245
    
    See merge request es/autosubmit!412

Summary

autosubmit monitor loads backup of jobs to display in the plot or text file, which can be misleading. I guess users would prefer to be alerted that the command failed instead, and use GUI or ask for help (worse, they may not notice the warning, and get confused with the statuses displayed).

Steps to reproduce

Simulate an IO error on LUMI

diff --git a/autosubmit/job/job_list.py b/autosubmit/job/job_list.py
index 9f5fbfcc..befad310 100644
--- a/autosubmit/job/job_list.py
+++ b/autosubmit/job/job_list.py
@@ -2321,6 +2321,7 @@ class JobList(object):
         """
         Log.info("Loading JobList")
         try:
+            raise IOError('Something IO happened on LUMI')
             return self._persistence.load(self._persistence_path, self._persistence_file)
         except AutosubmitCritical:
             raise

Pretend you have a backup with some old jobs, maybe several chunks ago?...

$ cp ~/autosubmit/a002/pkl/job_list_a002.pkl ~/autosubmit/a002/pkl/job_list_a002_backup.pkl

Run the command they reported in the production runs issue.

$ autosubmit monitor a002 -o txt
Autosubmit is running with 4.1.0
Getting job list...

Checking configuration files...
Expdef config file is correct
Platforms sections: OK
Jobs sections OK
Autosubmit general sections OK
Configuration files OK

Loading JobList
[ERROR] Autosubmit will use a backup for recover the job_list[eCode=6010]
Loading backup JobList
Creating jobs...
Load finished
Adding dependencies to the graph..
Adding dependencies to the job..
Transitive reduction...
Looking for edgeless jobs...
Writing status txt...
Status txt created at <_io.TextIOWrapper name='/home/kinow/autosubmit/a002/status/a002_20240308_0856.txt' mode='w+' encoding='UTF-8'>

The message is very similar to what they had, with the difference that now we do not replace the pickle file, so the jobs should not be in WAITING, but the output of monitor might be inconsistent.

What is the current bug behavior?

autosubmit monitor $expid may load the backup to display that to users, but it is probably better to show an error to the user.

What is the expected correct behavior?

An error, with our AS error code, and maybe a stack trace so users know which error it was (FileExists, PermissionError, FileNotFound, etc.) as that would help them (and us) to troubleshoot it.

Relevant logs and/or screenshots(if applicable)

Here's what happens with autosubmit monitor -o txt $expid right now.

bin/autosubmit.py
 -> autosubmit/autosubmit.py, function monitor()
   -> autosubmit/autosubmit.py, function load_job_list()
     -> autosubmit/autosubmit.py, function generate()
       -> autosubmit/job/job_list.py, function load() # Where "Loading JobList" message is printed...

The code of the function above is exactly this on master right now:

        Log.info("Loading JobList")
        try:
            return self._persistence.load(self._persistence_path, self._persistence_file)
        except AutosubmitCritical:
            raise
        except:
            Log.printlog(
                "Autosubmit will use a backup for recover the job_list", 6010)
            return self.backup_load()

So we print some logs, and then try to load the JobList. If it fails, and if the error raised was raised by Autosubmit (raising an AutosubmitCritical), then we do not load the back, and instead raise the error.

If the error is a BlockingIOError, BrokenPipeError, FileExistsError, (all I/O errors that could be raised in LUMI) then we go into the except part, and load the backup.

I think it's alright to load the backup here. But for the monitor command, I think it does not make sense to load any backup. Instead we must fail with a message like "Failed to load the job list: here's your error, here's your stack trace, and a code to search our docs", etc.

And instead of trying to handle the exceptions caught/raised, or workaround the I/O errors, it's much safer to have a flag to not load the backup at all, or a separate function, used by monitor and any other read-only commands.

Cheers

Any other relevant information(if applicable)

()