Autosubmit doesn't update the status and keeps running
Hi @dbeltran, I have an experiment that didn't update the status of the job and kept it as queuing but it had already ran (and failed):
4 of 7 jobs remaining (12:07)
Job a3n3_19500101_fc0_1_SIM is QUEUING
4 of 7 jobs remaining (12:08)
Job a3n3_19500101_fc0_1_SIM is QUEUING
(···)
4 of 7 jobs remaining (12:18)
Job a3n3_19500101_fc0_1_SIM is QUEUING
4 of 7 jobs remaining (12:19)
Command squeue -j 23552014, -o %A,%R in mn1.bsc.es warning: slurm_load_jobs error: Invalid job id specified
Job a3n3_19500101_fc0_1_SIM is QUEUING
But when looking at the LOG_a3n3 at the remote, the logs of the run where there (I had to move them manually to not be overwritten) and the last update was from 11 minutes before the squeue
command error:
-rw-r--r-- 1 bsc32627 bsc32 307063 Jun 15 10:11 a3n3_19500101_fc0_1_SIM.20220615100956.err
-rw-r--r-- 1 bsc32627 bsc32 408195 Jun 15 10:11 a3n3_19500101_fc0_1_SIM.20220615100956.out
-rw-r--r-- 1 bsc32627 bsc32 693938 Jun 16 12:07 a3n3_19500101_fc0_1_SIM.cmd.out_bckp
-rw-r--r-- 1 bsc32627 bsc32 295969 Jun 16 12:07 a3n3_19500101_fc0_1_SIM.cmd.err_bckp
-rwxr-xr-x 1 bsc32627 bsc32 27648 Jun 16 12:33 a3n3_19500101_fc0_1_SIM.cmd
From the looks of it, AS wasn't able to get the RUNNING
status from the remote, and then keeps looking for the same jobid while it is running, and once it fails (or finishes) it can't find the ID, but the autosubmit run
keeps going since it thinks that the job is still in QUEUE
until I stopped it manually. After that, using the autosubmit setstatus
and autosubmit run
commands it creates the job again with a new ID (I didn't test to use the run
command without setting the job back to waiting
).