Fallback mechanism in the check_job_status
Seems that the information that we obtain to check the job status of the jobs can hang an experiment workflow if the command reports incorrect info. My proposal is to add one of the following points or both.
-
To implement a check_wallclock in the check_job process if a job is running for longer than its wallclock. ( This is already working in wrappers, it matter of extend it to normal jobs) -
To implement a double command (if available) as a fallback, the issue with this is that each platform can have its own commands.
Also, yesterday in the process of recovering the platform, an experiment was hung while checking if the iteration with the platform was okay. I thought that command had a timeout by default(since it was working properly with all cases until yesterday) , seems that is not the case.