|
|
Objective
|
|
|
Autosubmit Wiki
|
|
|
---------
|
|
|
|
|
|
Autosubmit launches and monitors experiments on any platform used at
|
|
|
CFU. A general description of what is a typical climate forecast
|
|
|
experiment and what is the goal of Autosubmit, more technical
|
|
|
description of the architecture and how it works, how to install on your
|
|
|
computer, user's manual and documentation and Autosubmit developers'
|
|
|
page is available here.
|
|
|
|
|
|
- [List of experiments](http://enterprise:8000/autosubmit_v2) (only
|
|
|
accessible form inside CFU network)
|
|
|
|
|
|
Description
|
|
|
-----------
|
|
|
|
|
|
### General Description
|
|
|
|
|
|
#### Introduction
|
|
|
|
|
|
A typical climate forecast experiment is a run of a climate model over a
|
|
|
supercomputer having variable range of forecast length from a few months
|
|
|
to a few years. And an experiment may have one or more than one
|
|
|
start-dates and every start-date may comprise of single or many members.
|
|
|
The full length of forecasting period for the experiment could be
|
|
|
divided into number of chunks of fixed forecast length by exploiting the
|
|
|
available options of model restart. Furthermore, in the context of
|
|
|
computing operations, every chunk could have two big sections; parallel
|
|
|
section where the actually model run would be performed by using
|
|
|
computing cores of supercomputer and serial section(s) for performing
|
|
|
other necessary operations like post-processing of the model output,
|
|
|
archiving the model output and cleaning the disk space for the smooth
|
|
|
proceeding of the experiment.
|
|
|
![](Experiment_new.png "fig:Experiment_new.png")
|
|
|
|
|
|
As we could see in the sample experiment which consists of 10
|
|
|
start-dates from 1960 to 2005 where every start-date is independent of
|
|
|
each other and starting after every 5 years while each start-date
|
|
|
comprise of 5 members. Every member is also independent and has been
|
|
|
divided into 10 chunks which are dependent on each other. Now let us
|
|
|
suppose that the forecast length for each chunk is one year and every
|
|
|
chunk comprises of three types of jobs; a simulation (Sim), a
|
|
|
post-processing (Post) and an archiving and cleaning job (Clean).
|
|
|
Therefore with this typical exemplary experiment, one start-date with
|
|
|
one member comprise of 30 jobs and eventually 1500 jobs will be run in
|
|
|
total for the completion of the experiment. In short, there is a need of
|
|
|
a system to automate such type of typical experiments and optimize the
|
|
|
use of resources.
|
|
|
|
|
|
#### Goal
|
|
|
|
|
|
Autosubmit is a tool to manage and monitor climate forecasting
|
|
|
experiments by using supercomputers remotely and achieve the following
|
|
|
goals:
|
|
|
|
|
|
- Efficient handling of highly dependant jobs
|
|
|
- Optimum utilization of available computing resources
|
|
|
- Ease of starting, stopping and live monitoring of experiments
|
|
|
- Auto restarting the experiment or some part of experiment in case of
|
|
|
failure
|
|
|
- Use of database for experiment creation and assigning automatic
|
|
|
experiment identity
|
|
|
- Ability to reproduce the completed experiments fully or partially.
|
|
|
|
|
|
![](Autosubmit24.png "Autosubmit24.png")
|
|
|
|
|
|
### Technical Description
|
|
|
|
|
|
#### Introduction
|
|
|
|
|
|
Originally, Autosubmit consisted in one perl script (written by Xavi
|
|
|
Abellan\*) and could submit to the queue a sequence of jobs with
|
|
|
different parameters. All the jobs had a common template and autosubmit
|
|
|
would fill this template with different parameters value and submit the
|
|
|
jobs to the queue. Autosubmit would act as a wrapper around the
|
|
|
scheduler, monitoring the number of jobs submitted or queuing and would
|
|
|
submit a new one as soon as a space in the queue would appear until the
|
|
|
entire sequence of jobs is submitted.
|
|
|
|
|
|
This concept has been kept for the current Python version of Autosubmit
|
|
|
with a few capabilities added. The most interesting added capability is
|
|
|
that Autosubmit can now deal with the dependency between jobs. (i.e.: it
|
|
|
can wait for a particular job to finish before launching the next one)
|
|
|
Autosubmit can manage different type of job with different templates.
|
|
|
Autosubmit can also restart a failed job, stop the submission process
|
|
|
and restart where it left it. New object oriented design and refactoring
|
|
|
of Python code has been done in Autosubmit and now there is a new module
|
|
|
to create experiments from scratch and store small information into a
|
|
|
SQLite database. Thanks to this, there is also the possibility to
|
|
|
create, manage and monitor different types of experiments (currently
|
|
|
EC-Earth, NEMO and IFS) and to tackling with different queue schedulers
|
|
|
(such as PBS, SGE and SLURM).
|
|
|
|
|
|
![](Scheduler.png "Scheduler.png")
|
|
|
|
|
|
#### What is a Job?
|
|
|
|
|
|
A job in the HPC jargon is a program submitted to the queue system. It
|
|
|
can be serial or multi-threaded, use different type of queue and have
|
|
|
all the different directives than the scheduler of the HPC system
|
|
|
provides. Within Autosubmit a Job Class has been created and in the rest
|
|
|
of the documentation the term "Job" will refer to the python object from
|
|
|
that class. A job has several attributes: -job.name : This name must be
|
|
|
unique if several jobs are created. -job.id : This jobid is 0 by
|
|
|
construction and will be set by the scheduler, hence will only be unique
|
|
|
once the job has been submitted. -job.status: The status is updated
|
|
|
regularly and will tell Autosubmit whether a Job is Ready to be
|
|
|
submitted, completed, queuing etc. -job.type: Each job type has a
|
|
|
different template, so you can treat differently multi-processors and
|
|
|
serial jobs for example. -job.failcount: This counter is to keep track
|
|
|
of the number of time that a job has failed. At the moment if it fails
|
|
|
more than 4 times, the job is cancelled and not resubmitted.
|
|
|
|
|
|
The depency between jobs is dealt with the concept of inheritance. Each
|
|
|
Job has two more attributes: -job.Children : This is a list of dependent
|
|
|
jobs. Those children can only be launched once this job is completed.
|
|
|
-job.Parents : This the list of jobs from which it has to wait for
|
|
|
completion. Only when this list is empty can a job be submitted.
|
|
|
|
|
|
#### What is a JobList?
|
|
|
|
|
|
The JobList module regroups all the functions necessary for managing a
|
|
|
list of jobs. A joblist object can be sorted by status, type, jobid or
|
|
|
name and sublists can also be created from there. The updateJobList()
|
|
|
function is called at every loop of Autosubmit and does what it says on
|
|
|
the tin. The status of a job is then only 'true' directly after the call
|
|
|
of that function. The SaveJobList() function save the joblist in a
|
|
|
pickle file which can then be reloaded for a restart for example. Other
|
|
|
functions like updateGenealogy() are only called once after a joblist is
|
|
|
created. When the joblist is created, the dependency or inheritance
|
|
|
between jobs can only be created with the job names. The
|
|
|
updateGenealogy() function replace the children and parents names by job
|
|
|
objects.
|
|
|
|
|
|
#### General HPCQueue
|
|
|
|
|
|
Autosubmit needs to interact with the queue system regularly to know how
|
|
|
many jobs are in the queue and thus how many jobs can be submitted. The
|
|
|
HPCQueue abstract class provides all the functions necessary to
|
|
|
communicate with the scheduler so a job can be at all time checked,
|
|
|
cancel or submitted and the state of the queue assessed.
|
|
|
|
|
|
#### Concrete HPCQueue
|
|
|
|
|
|
A concrete queue is a specialization of an HPCQueue that inherits all
|
|
|
the functions common in a general queue and has concrete attributes and
|
|
|
concrete methods within each queue system. Autosubmit currently has the
|
|
|
concrete modules to wrap the queue commands from MareNostrum machines,
|
|
|
Ithaca cluster and Lindgren machines (MnQueue, ItQueue and LgQueue). A
|
|
|
concrete queue has several attributes: -queue.host: This is the host
|
|
|
name or the IP to set up connections. -queue.job\_status: Each job
|
|
|
status has a different code depending on the queue scheduler, so you can
|
|
|
treat differently the responses of each concrete HPCQueue.
|
|
|
-queue.submit\_cmd: This is the concrete command to submit jobs.
|
|
|
-queue.checkjob\_cmd: This is the concrete command to check a job
|
|
|
status. -queue.cancel\_cmd: This is the concrete command to cancel jobs.
|
|
|
|
|
|
![](Queues.png "Queues.png")
|
|
|
|
|
|
#### Monitoring the experiment
|
|
|
|
|
|
Additional functionality to monitor an experiment have been added in
|
|
|
Autosubmit. From the joblist, it is possible to create a "tree" to
|
|
|
visualize the status of the joblist. Each status has a different color
|
|
|
scheme: Green = running, red = failed etc.
|
|
|
|
|
|
![](JobListTree.png "JobListTree.png")
|
|
|
|
|
|
#### Job Wrapper
|
|
|
|
|
|
Currently supercomputers are increasing their number of cores rapidly
|
|
|
but also the rules to make use of them are become more strict (e.g.
|
|
|
minimum number of cores per job 2000). This is not feasible with the
|
|
|
current state of the EC-Earth which is difficult to scale beyond a few
|
|
|
hundred cores.
|
|
|
|
|
|
In order to provide a solution to the climate community we have been
|
|
|
making some test with a job wrapper. The idea behind this is to run
|
|
|
several ensamble members at the same time under the control of a python
|
|
|
script. We upload the script for each ensamble member we want to run.
|
|
|
The wrapper has to allocate resources for each of the script to run
|
|
|
(i.e. if each of the scripts requires 45 CPU and we want to run 10 that
|
|
|
would be 450). The wrapping python script creates a thread for every
|
|
|
ensamble member and runs them.
|
|
|
|
|
|
Further information:
|
|
|
|
|
|
1. International Conference on Computational Science (Cairns,
|
|
|
Australia, June 10 - 12, 2014), Impact of I/O and Data Management in
|
|
|
Ensemble Large Scale Climate Forecasting Using EC-Earth3.
|
|
|
![]( Poster_Masif_ICCS_2014.pdf "fig: Poster_Masif_ICCS_2014.pdf ")
|
|
|
2. [Asif](:File: masif_procs_2014.pdf "wikilink"), M., A. Cencerrado,
|
|
|
O. Mula-Valls, D. Manubens, F.J. Doblas-Reyes and A. Cortés (2014).
|
|
|
Impact of I/O and data management in ensemble large scale climate
|
|
|
forecasting using EC-Earth3. [Procedia Computer Science, 29,
|
|
|
2370-2379,
|
|
|
10.1016/j.procs.2014.05.221](http://www.sciencedirect.com/science/article/pii/S1877050914003986)
|
|
|
(SPECS, IS-ENES2, INCITE).
|
|
|
|
|
|
\<!--===== Lindgren =====
|
|
|
![](lindgren-test1-1.png "fig:lindgren-test1-1.png")
|
|
|
![](lindgren-test1-2.png "fig:lindgren-test1-2.png")
|
|
|
![](lindgren-test1-3.png "fig:lindgren-test1-3.png")
|
|
|
![](lindgren-test1-4.png "fig:lindgren-test1-4.png")
|
|
|
|
|
|
##### Jaguar
|
|
|
|
|
|
![](jaguar-test1-1.png "fig:jaguar-test1-1.png")
|
|
|
![](jaguar-test1-2.png "fig:jaguar-test1-2.png")
|
|
|
![](jaguar-test1-3.png "fig:jaguar-test1-3.png")
|
|
|
![](jaguar-test1-4.png "fig:jaguar-test1-4.png")
|
|
|
|
|
|
![](jaguar-test2-1.png "fig:jaguar-test2-1.png")
|
|
|
![](jaguar-test2-2.png "fig:jaguar-test2-2.png")
|
|
|
![](jaguar-test2-3.png "fig:jaguar-test2-3.png")
|
|
|
![](jaguar-test2-4.png "fig:jaguar-test2-4.png") --\>
|
|
|
|
|
|
#### IS-ENES 2
|
|
|
|
|
|
##### A CNRM-CM6 monitoring using Autosubmit
|
|
|
|
|
|
A few members of seasonal forecast experiment using CNRM-CM6 on ECMWF
|
|
|
IBM Power 7 has been performed using Autosubmit monitoring. A few day
|
|
|
long collaboration at IC3 has been sufficient to adapt the existing CNRM
|
|
|
workflow script to Autosubmit non-intrusive requirements. Nevertheless,
|
|
|
a more comprehensive work would be necessary to fully exploit Autosubmit
|
|
|
capabilities to monitor and control the full workflow (from compiling)
|
|
|
on any kind of supercomputer platform.
|
|
|
|
|
|
The technical report descirbing the work is available here:
|
|
|
<http://www.cerfacs.fr/globc/publication/technicalreport/2014/autosubmit_cnrm-cm.pdf>
|
|
|
|
|
|
Requirements
|
|
|
------------
|
|
|
|
|
|
### How to deploy/setup Autosubmit (v2)
|
|
|
|
|
|
Autosubmit has been tested: with the following Operating Systems:
|
|
|
|
|
|
- Linux Debian
|
|
|
|
|
|
on the following HPC's/Clusters:
|
|
|
|
|
|
- Ithaca (IC3 machine)
|
|
|
- MareNostrum (BSC machine)
|
|
|
- MareNostrum3 (BSC machine)
|
|
|
- HECToR (EPCC machine)
|
|
|
- Lindgren (PDC machine)
|
|
|
- C2A (ECMWF machine)
|
|
|
- ARCHER (EPCC machine)
|
|
|
|
|
|
Pre-requisties: These packages (python2, python-argparse,
|
|
|
python-dateutil, python-pydot, python-matplotlib, sqlite3) must be
|
|
|
available at local machine. And the machine is also able to access
|
|
|
HPC's/Clusters via password-less ssh.
|
|
|
|
|
|
Create a repository for experiments: Say for example "/cfu/autosubmit"
|
|
|
then edit the repository path into src/dir\_config.py, src/expid.py,
|
|
|
conf/autosubmit.conf Create a blank database: Say for example
|
|
|
"autosubmit.db" at above created repository and thereafter:
|
|
|
|
|
|
`> cd /cfu/autosubmit`\
|
|
|
`> sqlite3 autosubmit.db`\
|
|
|
`sqlite3>.read ../../src/autosubmit.sql`\
|
|
|
`> chmod 777 autosubmit.db`
|
|
|
|
|
|
Use
|
|
|
---
|
|
|
|
|
|
- Autosubmit 2.4.1 [documentation](http://autosubmit.ic3.cat)
|
|
|
- --[Dmanubens](User:Dmanubens "wikilink")
|
|
|
([talk](User talk:Dmanubens "wikilink")) 17:27, 4 July 2014
|
|
|
(CEST) - Autosubmit 2.4.1 CFU presentation
|
|
|
![](AS241.pdf "fig:AS241.pdf")
|
|
|
- Autosubmit 2.4.0
|
|
|
[documentation](http://autosubmit.ic3.cat/autosubmit2.4.0)
|
|
|
- Autosubmit 2.3
|
|
|
[documentation](http://autosubmit.ic3.cat/autosubmit2.3)
|
|
|
- Autosubmit 2.2
|
|
|
[documentation](http://autosubmit.ic3.cat/autosubmit2.2)
|
|
|
- Autosubmit 2.1
|
|
|
[documentation](http://autosubmit.ic3.cat/autosubmit2.1)
|
|
|
|
|
|
Repository
|
|
|
----------
|
|
|
|
|
|
To check out a working copy of autosubmit, from the CFU network: git
|
|
|
clone <https://dev.cfu.local/autosubmit.git>
|
|
|
your\_path\_to\_working\_copy
|
|
|
|
|
|
Contact
|
|
|
-------
|
|
|
|
|
|
The coordinator of this project is Domingo Manubens Gil
|
|
|
\<domingo.manubens@ic3.cat\>
|
|
|
Autosubmit launches and monitors experiments on any platform used at Earth Sciences Department. A general description of the goal of Autosubmit, how it works, how to install on your computer, user's manual and documentation is available here.
|
|
|
|
|
|
Domingo Manubens Gil \<domingo.manubens@ic3.cat\>, Oriol Mula-Valls
|
|
|
\<oriol.mula-valls@ic3.cat\>, Muhammad Asif \<muhammad.asif@ic3.cat\>,
|
|
|
Pierre-Antoine Bretonnière \<pierre-antoine.bretonniere@ic3.cat\>
|
|
|
### Description
|
|
|
|
|
|
As a new user, please register to this mailing list:
|
|
|
<http://autosubmit-users.ic3.cat/mailman/listinfo/autosubmit-users>
|
|
|
You'll then have access to the history of all the emails sent to the
|
|
|
users and presenting the functions and their available options.
|
|
|
Autosubmit is a **python** tool to create, manage and monitor experiments by using computing resources available at Computing Clusters, **HPC** and Supercomputers. It offers support for **experiments** running in more than one supercomputing platform and for different **workflow** configurations. Autosubmit manages the submission of jobs to queue scheduler remotely via ssh, until there is no ***job*** left to be run. Additionally, it also provides features to suspend, resume, restart and extend similar experiment at later stage.
|
|
|
|
|
|
Development
|
|
|
-----------
|
|
|
Autosubmit is currently used at Barcelona Supercomputing Centre (BSC) to run EC-Earth, NEMO and NMMB air quality models. Autosubmit has been used to manage models running at supercomputers in IC3, BSC, ECMWF, EPCC, PDC and OLCF.
|
|
|
|
|
|
### SCRUM Framework
|
|
|
|
|
|
- [ SCRUM Framework](Tools/SCRUM "wikilink")
|
|
|
Autosubmit is the only existing tool that satisfies the following requirements from the weather and climate community:
|
|
|
|
|
|
### GIT branching scheme
|
|
|
**Automatisation**: Job submission to machines and dependencies between jobs are managed by Autosubmit. No user intervention is needed.
|
|
|
|
|
|
- Since Autosubmit 2.2, templates and postp have been moved to new GIT
|
|
|
projects. See the following presentations for better understanding:
|
|
|
- Autosubmit and GIT: new projects
|
|
|
![](ASandGIT.pdf "fig:ASandGIT.pdf")
|
|
|
- Autosubmit 2.3 and GIT ![](AS23andGIT.pdf "fig:AS23andGIT.pdf")
|
|
|
**Data provenance**: Assigns unique identifiers for each experiment and stores information about model version, experiment configuration and computing facilities used in the whole process.
|
|
|
|
|
|
See the following page to check the current branching scheme used within
|
|
|
the GIT project 'autosubmit': [ Git branching
|
|
|
scheme](Computing/Git#GIT_branching_scheme "wikilink")
|
|
|
**Failure tolerance**: Automatic retrials and ability to rerun chunks in case of corrupted or missing data.
|
|
|
|
|
|
Style Guide
|
|
|
-----------
|
|
|
**Resource management**: Autosubmit manages supercomputer particularities, allowing users to run their experiments in the available machine without having to adapt the code. Autosubmit also allows to submit tasks from the same experiment to different platforms.
|
|
|
|
|
|
You can check the style guide for Autosubmit [ here
|
|
|
](Tools/StyleGuides/Python "wikilink")
|
|
|
### How to cite
|
|
|
|
|
|
```bash
|
|
|
$ autosubmit expid -H HPCname -d Description |
|
|
* D. Manubens-Gil, J. Vegas-Regidor, C. Prodhomme, O. Mula-Valls and F. J. Doblas-Reyes, “Seamless management of ensemble climate prediction experiments on HPC platforms,” 2016 International Conference on High Performance Computing & Simulation (HPCS), Innsbruck, 2016, pp. 895-900. doi: 10.1109/HPCSim.2016.7568429 ([PDF](https://earth.bsc.es/wiki/lib/exe/fetch.php?media=publications:dmanubens_hpcs_2016.pdf))
|
|
|
|
|
|
* BibTeX citation
|
|
|
|
|
|
### Contact persons
|
|
|
|
|
|
Code developed at [Barcelona Supercomputing Center](www.bsc.es) (BSC-CNS).
|
|
|
|
|
|
Developers:
|
|
|
* Domingo Manubens Gil - domingo.manubens@bsc.es (coordinator)
|
|
|
* Javier Vegas-Regidor - javier.vegas@bsc.es
|
|
|
* Larissa Batista Leite - larissa.batista@bsc.es
|
|
|
|
|
|
### Requirements
|
|
|
|
|
|
Autosubmit has been tested with the following Operating Systems:
|
|
|
* Linux Debian
|
|
|
* Linux OpenSUSE
|
|
|
|
|
|
**Pre-requisites**:
|
|
|
|
|
|
- These packages (bash, python2, sqlite3, git-scm > 1.8.2, subversion, dialog*) must be available at local machine.
|
|
|
- These packages (argparse, dateutil, pyparsing, numpy, pydotplus, matplotlib, paramiko, saga-python, python2-pythondialog*, mock, portalocker) must be available for python runtime. The machine needs to be able to access HPC platforms via password-less ssh.
|
|
|
|
|
|
*optional
|
|
|
|
|
|
### [Short examples](http://autosubmit.readthedocs.io/en/latest/introduction.html)
|
|
|
|
|
|
### User guide
|
|
|
|
|
|
* [Introduction](http://autosubmit.readthedocs.io/en/latest/introduction.html)
|
|
|
* [Tutorial](http://autosubmit.readthedocs.io/en/latest/tutorial.html)
|
|
|
* [Installation](http://autosubmit.readthedocs.io/en/latest/installation.html)
|
|
|
* [Usage](http://autosubmit.readthedocs.io/en/latest/usage.html)
|
|
|
* [Defining the workflow](http://autosubmit.readthedocs.io/en/latest/workflows.html)
|
|
|
|
|
|
### Change log
|
|
|
|
|
|
### Dissemination
|
|
|
|
|
|
* Publications
|
|
|
* Lectures
|
|
|
- [Techniques to improve the experiment throughput with Autosubmit] (https://earth.bsc.es/wiki/lib/exe/fetch.php?media=library:seminars:techniques_to_improve_the_throughput.pptx) - Domingo Manubens, 04/04/2017
|
|
|
* Tutorials
|
|
|
- https://earth.bsc.es/wiki/doku.php?id=tools:autosubmit:tutorials
|
|
|
- https://earth.bsc.es/wiki/doku.php?id=tools:autosubmit:past_tutorials
|
|
|
* Presentations
|
|
|
- [Call for Autosubmit users] (https://earth.bsc.es/wiki/lib/exe/fetch.php?media=tools:call_for_autosubmit_users_26_05_.pdf) - Domingo Manubens, 26/05/2016
|
|
|
- [Autosubmit 3.0.0 CFU presentation](https://earth.bsc.es/wiki/lib/exe/fetch.php?media=file:as300.pdf) - Domingo Manubens, 03/12/2014
|
|
|
- [Autosubmit 3.0.0 training] (https://earth.bsc.es/wiki/lib/exe/fetch.php?media=file:as3_training.pdf) - Domingo Manubens, 03/12/2014
|
|
|
- [Assesment report on Autosubmit, Cylc and ecFlow] (https://earth.bsc.es/wiki/lib/exe/fetch.php?media=tools:is-enes2_d93_v1.0_mp.pdf)
|
|
|
|
|
|
### Development
|
|
|
|
|
|
Package available at https://pypi.python.org/pypi/autosubmit
|
|
|
|
|
|
Update to the latest autosubmit development version using virtual environment
|
|
|
|
|
|
```
|
|
|
> ssh -X bscesautosubmit01
|
|
|
> module purge
|
|
|
> mkdir -p ~/venvs/as_dev
|
|
|
> virtualenv ~/venvs/as_dev
|
|
|
> source ~/venvs/as_dev/bin/activate
|
|
|
> pip install https://earth.bsc.es/gitlab/es/autosubmit/repository/archive.zip?ref=develop
|
|
|
> autosubmit -v
|
|
|
```
|
|
|
|
|
|
In the meantime, if there has been autosubmit development, to update it to the latest version:
|
|
|
|
|
|
```
|
|
|
> ssh -X bscesautosubmit01
|
|
|
> module purge
|
|
|
> source ~/venvs/as_dev/bin/activate
|
|
|
> pip install --upgrade https://earth.bsc.es/gitlab/es/autosubmit/repository/archive.zip?ref=develop
|
|
|
> autosubmit -v
|
|
|
```
|
|
|
|
|
|
### [Branching conventions] (https://earth.bsc.es/wiki/lib/exe/fetch.php?media=library:internal:20160711_jcuadrad_nmanuben_jlopez_common_git_branching_strategy.pdf)
|
|
|
|
|
|
### [Style guide] (https://earth.bsc.es/wiki/doku.php?id=tools:style_guides:python) |