Parallelization of EarthDiags jobs
Hi @jvegas !
As you may remember, I need to CMORize and make diagnostics on an experiment (a0dc
) that comprises many members (25) and many chunks (504). Given the current non-possibility to start the CMORization at a given chunk (issues #17 (closed) and #14 (closed)), I'm trying to find other work arounds to make diagnostics on this experiment. Indeed, I have tried CMORizing chunks up to 260, and after three days this is not yet finished...
So my idea was (actually, that was @pabretonniere 's idea so I think it's a good one :-) ) launch simultaneously 25 instances of earthdiags.py
by providing each time a different namelist with its own MEMBERS
variable set to 1, then 2, then 3, ... then 25.
However, when I do this, it looks like EarthDiags can only CMORize one member at a time. Indeed, one member is being CMORized correctly (the last one submitted), but the other ones have the following error:
a0dc_1d_19740101_19740131_grid_T.nc.gz
a0dc_1d_19740101_19740131_icemod.nc.gz
a0dc_1m_19740101_19740131_grid_T.nc.gz
a0dc_1m_19740101_19740131_grid_U.nc.gz
a0dc_1m_19740101_19740131_grid_V.nc.gz
a0dc_1m_19740101_19740131_icemod.nc.gz
Unzipping /scratch/Earth/fmassonn/diags/a0dc/CMOR/a0dc_1m_19740101_19740131_grid_V.nc.gz
gzip: /scratch/Earth/fmassonn/diags/a0dc/CMOR/a0dc_1m_19740101_19740131_grid_V.nc.gz: No such file or directory
Traceback (most recent call last):
File "./earthdiags.py", line 340, in <module>
main()
File "./earthdiags.py", line 336, in main
EarthDiags.parse_args()
File "./earthdiags.py", line 113, in parse_args
diags.run()
File "./earthdiags.py", line 147, in run
self.data_manager.prepare()
File "/home/Earth/fmassonn/git-stuff/earthdiagnostics/earthdiagnostics/cmormanager.py", line 300, in prepare
self._cmorize_member(startdate, member)
File "/home/Earth/fmassonn/git-stuff/earthdiagnostics/earthdiagnostics/cmormanager.py", line 321, in _cmorize_member
cmorizer.cmorize_ocean()
File "/home/Earth/fmassonn/git-stuff/earthdiagnostics/earthdiagnostics/cmorizer.py", line 62, in cmorize_ocean
self._cmorize_ocean_files('MMO')
File "/home/Earth/fmassonn/git-stuff/earthdiagnostics/earthdiagnostics/cmorizer.py", line 73, in _cmorize_ocean_files
self._unpack_tar_file(tarfile)
File "/home/Earth/fmassonn/git-stuff/earthdiagnostics/earthdiagnostics/cmorizer.py", line 98, in _unpack_tar_file
Utils.unzip(glob.glob(os.path.join(self.cmor_scratch, '*.gz')))
File "/home/Earth/fmassonn/git-stuff/earthdiagnostics/earthdiagnostics/utils.py", line 583, in unzip
raise Utils.UnzipException('Can not unzip {0}: {1}'.format(filepath, ex))
earthdiagnostics.utils.UnzipException: Can not unzip /scratch/Earth/fmassonn/diags/a0dc/CMOR/a0dc_1m_19740101_19740131_grid_V.nc.gz: ('Error executing {0}\n Return code: {1}', 'gunzip /scratch/Earth/fmassonn/diags/a0dc/CMOR/a0dc_1m_19740101_19740131_grid_V.nc.gz', 2)
I think this is related to the fact that the output file names don't have information about the member (see end of the message above), hence files are probably overriden or removed while they are read by another member.
Do you have an advice on how I could proceed to parallelize my jobs?
The script I'm using is here: /home/Earth/fmassonn/git-stuff/earthdiagnostics/process_a0dc.sh
and the logs (per member) are here: /home/Earth/fmassonn/git-stuff/earthdiagnostics/earthdiagnostics/log_1
(for member 1, crashed), /home/Earth/fmassonn/git-stuff/earthdiagnostics/earthdiagnostics/log_2
(for member 2, crashed), /home/Earth/fmassonn/git-stuff/earthdiagnostics/earthdiagnostics/log_3
(for member 3, success).
Many thanks for your help,
François