Download data with a dry-run simulation

Follow the steps below to perform a HEMCO standalone dry-run simulation:

Complete preliminary setup

Make sure that you have done the following steps;

Run the executable with the `--dryrun` flag

Run the HEMCO standalone executable file at the command line with the --dryrun command-line argument as shown below:

$ ./hemco_standalone -c HEMCO_sa_Config.rc --dryrun | tee log.dryrun

The tee command will send the output of the dryrun to the screen as well as to a file named log.dryrun.

The log.dryrun file will look somewhat like a regular HEMCO standalone log file but will also contain a list of data files and whether each file was found on disk or not. This information will be used by the download_data.py script in the next step.

You may use whatever name you like for the dry-run output log file (but we prefer log.dryrun).

Run the `download_data.py` script on the dryrun log file

Once you have successfully executed a HEMCO standalone dry-run, you can use the output from the dry-run (contained in the log.dryrun file) to download the data files that the HEMCO standalone will need to perform the corresponding “production” simulation. You will download data from the GEOS-Chem Input Data portal.

Initialize the GCPy Python environment

You will need to activate a Python environment before you can start downloading data. We recommend using the Python environment for GCPy, as it has all of the relevant packages installed. If you installed GCPy from PyPI, then no further action is needed. On the other hand, if you installed GCPy from conda-forge, you will need to activate the GCPy Python environment with this command:

$ conda activate gcpy_env
(gcpy_env) $

Activating the environment adds the prefix (gcpy_env) to the command prompt. This is a visual cue to remind you that the environment is active.

Run the download_data.py script

Navigate to the HEMCO run directory where you executed the dry-run simulation. You will use the download_data.py script to transfer data to your machine. The command you will use takes this form:

(gcpy_env) $ ./download_data.py log.dryrun PORTAL-NAME

where:

download_data.py is the dry-run data download program (written in Python). It is included in each HEMCO standalone run directory that you create.
log.dryrun is the log file from your HEMCO standalone dry-run simulation.

PORTAL-NAME specifies the data portal that you wish to download from. Allowed values are:

Allowed values for the `PORTAL-NAME` argument to `download_data.py`
Value	Downloads from portal	With this command	Via this method
geoschem+aws	GEOS-Chem Input Data	aws s3 cp	AWS CLI
geoschem+http	GEOS-Chem Input Data	wget	HTTP
rochester	GCAP 2.0 met data @ Rochester	wget	HTTP
skip-download	Skips data download altogether	N/A	N/A

For example, to download data from the GEOS-Chem Input Data portal, use this command:

(gcpy_env) $ ./download_data.py log.dryrun geoschem+http

But if you have AWS CLI (command-line interface) set up on your machine, use this command instead:

(gcpy_env) $ ./download_data.py log.dryrun geoschem+aws

This will result in a much faster data transfer than by HTTP. This is also the command you will use if you are running HEMCO Standalone on an AWS EC2 cloud instance.

(Optional) Examine the log of unique data files

The download_data.py script will generate a log of unique data files (i.e. with all duplicate listings removed), which looks similar to this:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!! LIST OF (UNIQUE) FILES REQUIRED FOR THE SIMULATION
!!! Start Date       : 20190701 000000
!!! End Date         : 20190701 010000
!!! Simulation       : fullchem
!!! Meteorology      : MERRA2
!!! Grid Resolution  : 4.0x5.0
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
./HEMCO_Config.rc
./HEMCO_Config.rc.gmao_metfields
./HEMCO_Diagn.rc
./HISTORY.rc
./Restarts/GEOSChem.Restart.20190701_0000z.nc4 --> /home/ubuntu/ExtData/GEOSCHEM_RESTARTS/GC_14.5.0/GEOSChem.Restart.fullchem.20190701_0000z.nc4
./Restarts/HEMCO_restart.201907010000.nc
./geoschem_config.yml
/path/to/ExtData/CHEM_INPUTS/CLOUD_J/v2024-09/FJX_j2j.dat
/path/to/ExtData/CHEM_INPUTS/CLOUD_J/v2024-09/FJX_scat-aer.dat
/path/to/ExtData/CHEM_INPUTS/CLOUD_J/v2024-09/FJX_scat-cld.dat
/path/to/ExtData/CHEM_INPUTS/CLOUD_J/v2024-09/FJX_scat-ssa.dat
/path/to/ExtData/CHEM_INPUTS/CLOUD_J/v2024-09/FJX_spec.dat
/path/to/ExtData/CHEM_INPUTS/FastJ_201204/fastj.jv_atms_dat.nc
/path/to/ExtData/CHEM_INPUTS/Linoz_200910/Linoz_March2007.dat
/path/to/ExtData/CHEM_INPUTS/Olson_Land_Map_201203/Olson_2001_Drydep_Inputs.nc
/path/to/ExtData/CHEM_INPUTS/UCX_201403/NoonTime/Grid4x5/InitCFC_JN2O_01.dat

 ... etc ...

This name of this “unique” log file will be the same as the log file with dryrun ouptut, with .unique appended. In our above example, we passed log.dryrun to download_data.py, so the “unique” log file will be named log.dryrun.unique. This “unique” log file can be very useful for documentation purposes.

If you wish to only produce the log of unique data files without downloading any data, then use skip-download in place of the PORTAL-NAME when running donwload_data.py:

(gcpy_env) $ ./download_data.py log.dryrun skip-download

You can also abbreviate the command to:

(gcpy_env) $ ./download_data.py log.dryrun skip

This can be useful if you already have the necessary data downloaded to your system but wish to create the log of unique files for documentation purposes.

Deactivate the GCPy Python environment

Once you have downloaded all of the data needed for your GEOS-Chem Classic simulation, you can deactivate the GCPy Python environment.

(gcpy_env) $ conda deactivate
$

This will remove the (gcpy_env) prefix from the command prompt.