Launch jobs on MN5 (and CTE-AMD?)
This is an issue to track the changes needed to run jobs on HPC machines (particularly those that do not have /esarchive mounted e.g. MN5).
Running without Autosubmit
This case is simple because the launcher is run directly from the HPC machine. In the case of MN5, it is mandatory to specify the group (bsc32
) and the QoS (gp_bsces
) in sbatch calls. Therefore I added the following lines to launch_SUNSET.sh:
# Is machine MN5?
if [[ $BSC_MACHINE == "mn5" ]]; then
platform_params="-A bsc32 -q gp_bsces"
else
platform_params=""
fi
And then added the platform_params
variable to the sbatch calls.
Running with Autosubmit
The case of Autosubmit is more complicated due to the fact that the code runs on two different platforms; one with access to esarchive but not to gpfs (Workstation, Hub) and another with access to gpfs, but not esarchive (HPC machine).
As far as the user-specified parameters:
- The output directory should be in esarchive
- The filesystem should be the filesystem of the HPC platform (i.e. gpfs)
- The platform should be specified in the auto_conf section
Run:
Loglevel: INFO
Terminal: yes
filesystem: gpfs
output_dir: /esarchive/scratch/vagudets/auto-s2s-outputs/
code_dir: /home/Earth/vagudets/git/auto-s2s/
autosubmit: yes
# fill only if using autosubmit
auto_conf:
script: use_cases/ex1_2_autosubmit_scorecards/ex1_2-script.R # replace with the path to your script
expid: a7rn # replace with your EXPID
hpc_user: bsc032762 # replace with your hpc username
wallclock: 03:00 # hh:mm
processors_per_job: 16
platform: mn5
custom_directives: # ['#SBATCH --exclusive']
email_notifications: yes # enable/disable email notifications. Change it if you want to.
email_address: victoria.agudetse@bsc.es # replace with your email address
notify_completed: yes # notify me by email when a job finishes
notify_failed: no # notify me by email when a job fails
The general idea is that the atomic recipes and logs are first saved in esarchive. When autosubmit is launched, the first job transfer_recipes copies this data to a temporary directory in gpfs. It also copies the SUNSET repository, so that the code can be accessed from gpfs. After the last job runs, another job transfer_results copies the results from gpfs to esarchive, and removes the temporary directory.
In the conf/autosubmit.yml configuration file, I have added a new field tmp_dir, which specifies the structure of the temporary directory for each filesystem. For example:
gpfs:
platform: mn5
module_version: autosubmit/4.0.98-foss-2015a-Python-3.7.3
auto_version: 4.0.98
conf_format: yaml
experiment_dir: /esarchive/autosubmit/
userID: bsc032
tmp_dir: /gpfs/scratch/bsc32/$user$/tmp/
Where $user$
will be replaced with the username specified in the recipe. prepare_outputs()
parses the structure of this tmp_dir
and adds the information to the recipe. Then, when write_autosubmit_conf()
is called, this information is also added to the autosubmit experiment configuration files, and the platform information and dependencies between the jobs are specified according to what the user has requested.
transfer_recipes
is run from one of the data transfer machines. Once the recipes have been transferred, the verification job runs as usual, launched in MN5. read_atomic_recipe()
replaces the output_dir
in the recipe with tmp_dir
so that all the files are saved there in gpfs.
When all the jobs finished, transfer_results
has a dependency on the last job of the workflow. This means that if only verification has been requested, it will run after all the verification jobs have finished. If the Scorecards have also been requested, then it will run after the Scorecards.
For gpfs, a new autosubmit configuration template has been created under autosubmit/conf_gpfs/
. Here, I modified:
- platforms.yml: To add the information for MN5 and for the transfer machine. Technically, other platforms like nord3v2 and CTE-AMD could also be used.
- jobs.yml: To add the information for the
transfer_recipes
andtransfer_results
jobs.
An additional note is that I have had to add the command set +eu
before source MODULES
in the autosubmit/auto-verification.sh and autosubmit/auto-scorecards.sh scripts. This is because of a bug in conda that causes problem activating a conda environment from inside a shell script.
I am currently testing this and will push the development to the dev-launch_on_MN5 branch when I have a working version.
FYI @nperez, let me know if you have any additional comments or requests.
Cheers,
Victòria