|
|
Objective
|
|
|
---------
|
|
|
|
|
|
Autosubmit launches and monitors experiments on any platform used at
|
|
|
CFU. A general description of what is a typical climate forecast
|
|
|
experiment and what is the goal of Autosubmit, more technical
|
|
|
description of the architecture and how it works, how to install on your
|
|
|
computer, user's manual and documentation and Autosubmit developers'
|
|
|
page is available here.
|
|
|
|
|
|
- [List of experiments](http://enterprise:8000/autosubmit_v2) (only
|
|
|
accessible form inside CFU network)
|
|
|
|
|
|
```[Autosubmit-triptic-01-2016.pdf](https://earth.bsc.es/gitlab/es/autosubmit/uploads/c3c51d514d45406efeb06a388de88f89/Autosubmit-triptic-01-2016.pdf)
|
|
|
|
|
|
Description
|
|
|
-----------
|
|
|
|
|
|
### General Description
|
|
|
|
|
|
#### Introduction
|
|
|
|
|
|
A typical climate forecast experiment is a run of a climate model over a
|
|
|
supercomputer having variable range of forecast length from a few months
|
|
|
to a few years. And an experiment may have one or more than one
|
|
|
start-dates and every start-date may comprise of single or many members.
|
|
|
The full length of forecasting period for the experiment could be
|
|
|
divided into number of chunks of fixed forecast length by exploiting the
|
|
|
available options of model restart. Furthermore, in the context of
|
|
|
computing operations, every chunk could have two big sections; parallel
|
|
|
section where the actually model run would be performed by using
|
|
|
computing cores of supercomputer and serial section(s) for performing
|
|
|
other necessary operations like post-processing of the model output,
|
|
|
archiving the model output and cleaning the disk space for the smooth
|
|
|
proceeding of the experiment.
|
|
|
![](Experiment_new.png "fig:Experiment_new.png")
|
|
|
|
|
|
As we could see in the sample experiment which consists of 10
|
|
|
start-dates from 1960 to 2005 where every start-date is independent of
|
|
|
each other and starting after every 5 years while each start-date
|
|
|
comprise of 5 members. Every member is also independent and has been
|
|
|
divided into 10 chunks which are dependent on each other. Now let us
|
|
|
suppose that the forecast length for each chunk is one year and every
|
|
|
chunk comprises of three types of jobs; a simulation (Sim), a
|
|
|
post-processing (Post) and an archiving and cleaning job (Clean).
|
|
|
Therefore with this typical exemplary experiment, one start-date with
|
|
|
one member comprise of 30 jobs and eventually 1500 jobs will be run in
|
|
|
total for the completion of the experiment. In short, there is a need of
|
|
|
a system to automate such type of typical experiments and optimize the
|
|
|
use of resources.
|
|
|
|
|
|
#### Goal
|
|
|
|
|
|
Autosubmit is a tool to manage and monitor climate forecasting
|
|
|
experiments by using supercomputers remotely and achieve the following
|
|
|
goals:
|
|
|
|
|
|
- Efficient handling of highly dependant jobs
|
|
|
- Optimum utilization of available computing resources
|
|
|
- Ease of starting, stopping and live monitoring of experiments
|
|
|
- Auto restarting the experiment or some part of experiment in case of
|
|
|
failure
|
|
|
- Use of database for experiment creation and assigning automatic
|
|
|
experiment identity
|
|
|
- Ability to reproduce the completed experiments fully or partially.
|
|
|
|
|
|
![](Autosubmit24.png "Autosubmit24.png")
|
|
|
|
|
|
### Technical Description
|
|
|
|
|
|
#### Introduction
|
|
|
|
|
|
Originally, Autosubmit consisted in one perl script (written by Xavi
|
|
|
Abellan\*) and could submit to the queue a sequence of jobs with
|
|
|
different parameters. All the jobs had a common template and autosubmit
|
|
|
would fill this template with different parameters value and submit the
|
|
|
jobs to the queue. Autosubmit would act as a wrapper around the
|
|
|
scheduler, monitoring the number of jobs submitted or queuing and would
|
|
|
submit a new one as soon as a space in the queue would appear until the
|
|
|
entire sequence of jobs is submitted.
|
|
|
|
|
|
This concept has been kept for the current Python version of Autosubmit
|
|
|
with a few capabilities added. The most interesting added capability is
|
|
|
that Autosubmit can now deal with the dependency between jobs. (i.e.: it
|
|
|
can wait for a particular job to finish before launching the next one)
|
|
|
Autosubmit can manage different type of job with different templates.
|
|
|
Autosubmit can also restart a failed job, stop the submission process
|
|
|
and restart where it left it. New object oriented design and refactoring
|
|
|
of Python code has been done in Autosubmit and now there is a new module
|
|
|
to create experiments from scratch and store small information into a
|
|
|
SQLite database. Thanks to this, there is also the possibility to
|
|
|
create, manage and monitor different types of experiments (currently
|
|
|
EC-Earth, NEMO and IFS) and to tackling with different queue schedulers
|
|
|
(such as PBS, SGE and SLURM).
|
|
|
|
|
|
![](Scheduler.png "Scheduler.png")
|
|
|
|
|
|
#### What is a Job?
|
|
|
|
|
|
A job in the HPC jargon is a program submitted to the queue system. It
|
|
|
can be serial or multi-threaded, use different type of queue and have
|
|
|
all the different directives than the scheduler of the HPC system
|
|
|
provides. Within Autosubmit a Job Class has been created and in the rest
|
|
|
of the documentation the term "Job" will refer to the python object from
|
|
|
that class. A job has several attributes: -job.name : This name must be
|
|
|
unique if several jobs are created. -job.id : This jobid is 0 by
|
|
|
construction and will be set by the scheduler, hence will only be unique
|
|
|
once the job has been submitted. -job.status: The status is updated
|
|
|
regularly and will tell Autosubmit whether a Job is Ready to be
|
|
|
submitted, completed, queuing etc. -job.type: Each job type has a
|
|
|
different template, so you can treat differently multi-processors and
|
|
|
serial jobs for example. -job.failcount: This counter is to keep track
|
|
|
of the number of time that a job has failed. At the moment if it fails
|
|
|
more than 4 times, the job is cancelled and not resubmitted.
|
|
|
|
|
|
The depency between jobs is dealt with the concept of inheritance. Each
|
|
|
Job has two more attributes: -job.Children : This is a list of dependent
|
|
|
jobs. Those children can only be launched once this job is completed.
|
|
|
-job.Parents : This the list of jobs from which it has to wait for
|
|
|
completion. Only when this list is empty can a job be submitted.
|
|
|
|
|
|
#### What is a JobList?
|
|
|
|
|
|
The JobList module regroups all the functions necessary for managing a
|
|
|
list of jobs. A joblist object can be sorted by status, type, jobid or
|
|
|
name and sublists can also be created from there. The updateJobList()
|
|
|
function is called at every loop of Autosubmit and does what it says on
|
|
|
the tin. The status of a job is then only 'true' directly after the call
|
|
|
of that function. The SaveJobList() function save the joblist in a
|
|
|
pickle file which can then be reloaded for a restart for example. Other
|
|
|
functions like updateGenealogy() are only called once after a joblist is
|
|
|
created. When the joblist is created, the dependency or inheritance
|
|
|
between jobs can only be created with the job names. The
|
|
|
updateGenealogy() function replace the children and parents names by job
|
|
|
objects.
|
|
|
|
|
|
#### General HPCQueue
|
|
|
|
|
|
Autosubmit needs to interact with the queue system regularly to know how
|
|
|
many jobs are in the queue and thus how many jobs can be submitted. The
|
|
|
HPCQueue abstract class provides all the functions necessary to
|
|
|
communicate with the scheduler so a job can be at all time checked,
|
|
|
cancel or submitted and the state of the queue assessed.
|
|
|
|
|
|
#### Concrete HPCQueue
|
|
|
|
|
|
A concrete queue is a specialization of an HPCQueue that inherits all
|
|
|
the functions common in a general queue and has concrete attributes and
|
|
|
concrete methods within each queue system. Autosubmit currently has the
|
|
|
concrete modules to wrap the queue commands from MareNostrum machines,
|
|
|
Ithaca cluster and Lindgren machines (MnQueue, ItQueue and LgQueue). A
|
|
|
concrete queue has several attributes: -queue.host: This is the host
|
|
|
name or the IP to set up connections. -queue.job\_status: Each job
|
|
|
status has a different code depending on the queue scheduler, so you can
|
|
|
treat differently the responses of each concrete HPCQueue.
|
|
|
-queue.submit\_cmd: This is the concrete command to submit jobs.
|
|
|
-queue.checkjob\_cmd: This is the concrete command to check a job
|
|
|
status. -queue.cancel\_cmd: This is the concrete command to cancel jobs.
|
|
|
|
|
|
![](Queues.png "Queues.png")
|
|
|
|
|
|
#### Monitoring the experiment
|
|
|
|
|
|
Additional functionality to monitor an experiment have been added in
|
|
|
Autosubmit. From the joblist, it is possible to create a "tree" to
|
|
|
visualize the status of the joblist. Each status has a different color
|
|
|
scheme: Green = running, red = failed etc.
|
|
|
|
|
|
![](JobListTree.png "JobListTree.png")
|
|
|
|
|
|
#### Job Wrapper
|
|
|
|
|
|
Currently supercomputers are increasing their number of cores rapidly
|
|
|
but also the rules to make use of them are become more strict (e.g.
|
|
|
minimum number of cores per job 2000). This is not feasible with the
|
|
|
current state of the EC-Earth which is difficult to scale beyond a few
|
|
|
hundred cores.
|
|
|
|
|
|
In order to provide a solution to the climate community we have been
|
|
|
making some test with a job wrapper. The idea behind this is to run
|
|
|
several ensamble members at the same time under the control of a python
|
|
|
script. We upload the script for each ensamble member we want to run.
|
|
|
The wrapper has to allocate resources for each of the script to run
|
|
|
(i.e. if each of the scripts requires 45 CPU and we want to run 10 that
|
|
|
would be 450). The wrapping python script creates a thread for every
|
|
|
ensamble member and runs them.
|
|
|
|
|
|
Further information:
|
|
|
|
|
|
1. International Conference on Computational Science (Cairns,
|
|
|
Australia, June 10 - 12, 2014), Impact of I/O and Data Management in
|
|
|
Ensemble Large Scale Climate Forecasting Using EC-Earth3.
|
|
|
![]( Poster_Masif_ICCS_2014.pdf "fig: Poster_Masif_ICCS_2014.pdf ")
|
|
|
2. [Asif](:File: masif_procs_2014.pdf "wikilink"), M., A. Cencerrado,
|
|
|
O. Mula-Valls, D. Manubens, F.J. Doblas-Reyes and A. Cortés (2014).
|
|
|
Impact of I/O and data management in ensemble large scale climate
|
|
|
forecasting using EC-Earth3. [Procedia Computer Science, 29,
|
|
|
2370-2379,
|
|
|
10.1016/j.procs.2014.05.221](http://www.sciencedirect.com/science/article/pii/S1877050914003986)
|
|
|
(SPECS, IS-ENES2, INCITE).
|
|
|
|
|
|
\<!--===== Lindgren =====
|
|
|
![](lindgren-test1-1.png "fig:lindgren-test1-1.png")
|
|
|
![](lindgren-test1-2.png "fig:lindgren-test1-2.png")
|
|
|
![](lindgren-test1-3.png "fig:lindgren-test1-3.png")
|
|
|
![](lindgren-test1-4.png "fig:lindgren-test1-4.png")
|
|
|
|
|
|
##### Jaguar
|
|
|
|
|
|
![](jaguar-test1-1.png "fig:jaguar-test1-1.png")
|
|
|
![](jaguar-test1-2.png "fig:jaguar-test1-2.png")
|
|
|
![](jaguar-test1-3.png "fig:jaguar-test1-3.png")
|
|
|
![](jaguar-test1-4.png "fig:jaguar-test1-4.png")
|
|
|
|
|
|
![](jaguar-test2-1.png "fig:jaguar-test2-1.png")
|
|
|
![](jaguar-test2-2.png "fig:jaguar-test2-2.png")
|
|
|
![](jaguar-test2-3.png "fig:jaguar-test2-3.png")
|
|
|
![](jaguar-test2-4.png "fig:jaguar-test2-4.png") --\>
|
|
|
|
|
|
#### IS-ENES 2
|
|
|
|
|
|
##### A CNRM-CM6 monitoring using Autosubmit
|
|
|
|
|
|
A few members of seasonal forecast experiment using CNRM-CM6 on ECMWF
|
|
|
IBM Power 7 has been performed using Autosubmit monitoring. A few day
|
|
|
long collaboration at IC3 has been sufficient to adapt the existing CNRM
|
|
|
workflow script to Autosubmit non-intrusive requirements. Nevertheless,
|
|
|
a more comprehensive work would be necessary to fully exploit Autosubmit
|
|
|
capabilities to monitor and control the full workflow (from compiling)
|
|
|
on any kind of supercomputer platform.
|
|
|
|
|
|
The technical report descirbing the work is available here:
|
|
|
<http://www.cerfacs.fr/globc/publication/technicalreport/2014/autosubmit_cnrm-cm.pdf>
|
|
|
|
|
|
Requirements
|
|
|
------------
|
|
|
|
|
|
### How to deploy/setup Autosubmit (v2)
|
|
|
|
|
|
Autosubmit has been tested: with the following Operating Systems:
|
|
|
|
|
|
- Linux Debian
|
|
|
|
|
|
on the following HPC's/Clusters:
|
|
|
|
|
|
- Ithaca (IC3 machine)
|
|
|
- MareNostrum (BSC machine)
|
|
|
- MareNostrum3 (BSC machine)
|
|
|
- HECToR (EPCC machine)
|
|
|
- Lindgren (PDC machine)
|
|
|
- C2A (ECMWF machine)
|
|
|
- ARCHER (EPCC machine)
|
|
|
|
|
|
Pre-requisties: These packages (python2, python-argparse,
|
|
|
python-dateutil, python-pydot, python-matplotlib, sqlite3) must be
|
|
|
available at local machine. And the machine is also able to access
|
|
|
HPC's/Clusters via password-less ssh.
|
|
|
|
|
|
Create a repository for experiments: Say for example "/cfu/autosubmit"
|
|
|
then edit the repository path into src/dir\_config.py, src/expid.py,
|
|
|
conf/autosubmit.conf Create a blank database: Say for example
|
|
|
"autosubmit.db" at above created repository and thereafter:
|
|
|
|
|
|
`> cd /cfu/autosubmit`\
|
|
|
`> sqlite3 autosubmit.db`\
|
|
|
`sqlite3>.read ../../src/autosubmit.sql`\
|
|
|
`> chmod 777 autosubmit.db`
|
|
|
|
|
|
Use
|
|
|
---
|
|
|
|
|
|
- Autosubmit 2.4.1 [documentation](http://autosubmit.ic3.cat)
|
|
|
- --[Dmanubens](User:Dmanubens "wikilink")
|
|
|
([talk](User talk:Dmanubens "wikilink")) 17:27, 4 July 2014
|
|
|
(CEST) - Autosubmit 2.4.1 CFU presentation
|
|
|
![](AS241.pdf "fig:AS241.pdf")
|
|
|
- Autosubmit 2.4.0
|
|
|
[documentation](http://autosubmit.ic3.cat/autosubmit2.4.0)
|
|
|
- Autosubmit 2.3
|
|
|
[documentation](http://autosubmit.ic3.cat/autosubmit2.3)
|
|
|
- Autosubmit 2.2
|
|
|
[documentation](http://autosubmit.ic3.cat/autosubmit2.2)
|
|
|
- Autosubmit 2.1
|
|
|
[documentation](http://autosubmit.ic3.cat/autosubmit2.1)
|
|
|
|
|
|
Repository
|
|
|
----------
|
|
|
|
|
|
To check out a working copy of autosubmit, from the CFU network: git
|
|
|
clone <https://dev.cfu.local/autosubmit.git>
|
|
|
your\_path\_to\_working\_copy
|
|
|
|
|
|
Contact
|
|
|
-------
|
|
|
|
|
|
The coordinator of this project is Domingo Manubens Gil
|
|
|
\<domingo.manubens@ic3.cat\>
|
|
|
|
|
|
Domingo Manubens Gil \<domingo.manubens@ic3.cat\>, Oriol Mula-Valls
|
|
|
\<oriol.mula-valls@ic3.cat\>, Muhammad Asif \<muhammad.asif@ic3.cat\>,
|
|
|
Pierre-Antoine Bretonnière \<pierre-antoine.bretonniere@ic3.cat\>
|
|
|
|
|
|
As a new user, please register to this mailing list:
|
|
|
<http://autosubmit-users.ic3.cat/mailman/listinfo/autosubmit-users>
|
|
|
You'll then have access to the history of all the emails sent to the
|
|
|
users and presenting the functions and their available options.
|
|
|
|
|
|
Development
|
|
|
-----------
|
|
|
|
|
|
### SCRUM Framework
|
|
|
|
|
|
- [ SCRUM Framework](Tools/SCRUM "wikilink")
|
|
|
|
|
|
### GIT branching scheme
|
|
|
|
|
|
- Since Autosubmit 2.2, templates and postp have been moved to new GIT
|
|
|
projects. See the following presentations for better understanding:
|
|
|
- Autosubmit and GIT: new projects
|
|
|
![](ASandGIT.pdf "fig:ASandGIT.pdf")
|
|
|
- Autosubmit 2.3 and GIT ![](AS23andGIT.pdf "fig:AS23andGIT.pdf")
|
|
|
|
|
|
See the following page to check the current branching scheme used within
|
|
|
the GIT project 'autosubmit': [ Git branching
|
|
|
scheme](Computing/Git#GIT_branching_scheme "wikilink")
|
|
|
|
|
|
Style Guide
|
|
|
-----------
|
|
|
|
|
|
You can check the style guide for Autosubmit [ here
|
|
|
](Tools/StyleGuides/Python "wikilink")
|
|
|
|
|
|
```bash
|
|
|
$ autosubmit expid -H HPCname -d Description |