Launch jobs on MN5 (and CTE-AMD?)

This is an issue to track the changes needed to run jobs on HPC machines (particularly those that do not have /esarchive mounted e.g. MN5).

Running without Autosubmit

This case is simple because the launcher is run directly from the HPC machine. In the case of MN5, it is mandatory to specify the group (bsc32) and the QoS (gp_bsces) in sbatch calls. Therefore I added the following lines to launch_SUNSET.sh:

  # Is machine MN5?
  if [[ $BSC_MACHINE == "mn5" ]]; then
    platform_params="-A bsc32 -q gp_bsces"
  else
    platform_params=""
  fi

And then added the platform_params variable to the sbatch calls.

Running with Autosubmit

The case of Autosubmit is more complicated due to the fact that the code runs on two different platforms; one with access to esarchive but not to gpfs (Workstation, Hub) and another with access to gpfs, but not esarchive (HPC machine).

As far as the user-specified parameters:

The output directory should be in esarchive
The filesystem should be the filesystem of the HPC platform (i.e. gpfs)
The platform should be specified in the auto_conf section

Run:
  Loglevel: INFO
  Terminal: yes
  filesystem: gpfs
  output_dir: /esarchive/scratch/vagudets/auto-s2s-outputs/
  code_dir: /home/Earth/vagudets/git/auto-s2s/
  autosubmit: yes
  # fill only if using autosubmit
  auto_conf:
    script: use_cases/ex1_2_autosubmit_scorecards/ex1_2-script.R # replace with the path to your script
    expid: a7rn # replace with your EXPID
    hpc_user: bsc032762 # replace with your hpc username
    wallclock: 03:00 # hh:mm
    processors_per_job: 16
    platform: mn5
    custom_directives: # ['#SBATCH --exclusive'] 
    email_notifications: yes # enable/disable email notifications. Change it if you want to.
    email_address: victoria.agudetse@bsc.es # replace with your email address
    notify_completed: yes # notify me by email when a job finishes
    notify_failed: no # notify me by email when a job fails

The general idea is that the atomic recipes and logs are first saved in esarchive. When autosubmit is launched, the first job transfer_recipes copies this data to a temporary directory in gpfs. It also copies the SUNSET repository, so that the code can be accessed from gpfs. After the last job runs, another job transfer_results copies the results from gpfs to esarchive, and removes the temporary directory.

In the conf/autosubmit.yml configuration file, I have added a new field tmp_dir, which specifies the structure of the temporary directory for each filesystem. For example:

gpfs:
  platform: mn5
  module_version: autosubmit/4.0.98-foss-2015a-Python-3.7.3
  auto_version: 4.0.98
  conf_format: yaml
  experiment_dir: /esarchive/autosubmit/
  userID: bsc032
  tmp_dir: /gpfs/scratch/bsc32/$user$/tmp/

Where $user$ will be replaced with the username specified in the recipe. prepare_outputs() parses the structure of this tmp_dir and adds the information to the recipe. Then, when write_autosubmit_conf() is called, this information is also added to the autosubmit experiment configuration files, and the platform information and dependencies between the jobs are specified according to what the user has requested.

transfer_recipes is run from one of the data transfer machines. Once the recipes have been transferred, the verification job runs as usual, launched in MN5. read_atomic_recipe() replaces the output_dir in the recipe with tmp_dir so that all the files are saved there in gpfs.

When all the jobs finished, transfer_results has a dependency on the last job of the workflow. This means that if only verification has been requested, it will run after all the verification jobs have finished. If the Scorecards have also been requested, then it will run after the Scorecards.

For gpfs, a new autosubmit configuration template has been created under autosubmit/conf_gpfs/. Here, I modified:

platforms.yml: To add the information for MN5 and for the transfer machine. Technically, other platforms like nord3v2 and CTE-AMD could also be used.
jobs.yml: To add the information for the transfer_recipes and transfer_results jobs.

An additional note is that I have had to add the command set +eu before source MODULES in the autosubmit/auto-verification.sh and autosubmit/auto-scorecards.sh scripts. This is because of a bug in conda that causes problem activating a conda environment from inside a shell script.

I am currently testing this and will push the development to the dev-launch_on_MN5 branch when I have a working version.

FYI @nperez, let me know if you have any additional comments or requests.

Cheers,

Victòria

Edited Sep 20, 2024 by vagudets