Compute Canada is awesome! Seriously, it is! As explained on its website:
Compute Canada, in partnership with regional organizations ACENET, Calcul Québec, Compute Ontario and WestGrid, leads the acceleration of research and innovation by deploying state-of-the-art advanced research computing (ARC) systems, storage and software solutions. Together we provide essential ARC services and infrastructure for Canadian researchers and their collaborators in all academic and industrial sectors. Our world-class team of more than 200 experts employed by 37 partner universities and research institutions across the country provide direct support to research teams. Compute Canada is a proud ambassador for Canadian excellence in advanced research computing nationally and internationally.1
A couple of years ago, while I was doing my PhD, I had to run hundreds of models and some of the simulations took days! I couldn’t have done this without Compute Canada’s resources. It was a very positive experience for several reasons:
- Compute Canada’s hardware and software are top-notch;
- the documentation is exhaustive and very clear, check out the wiki;
- Compute Canada’s staff offers great support;
- I’ve learned a lot;
At that time, I was in Quebec (Calcul Québec was the regional partner) and had access to Mammoth parallel II (see the wiki for more details about Mp2) that was running under CentOS and had TORQUE as resource manager. For the record, below is the bash script I wrote to deploy all my models:
|
|
The #PBS
directives indicate the resources I needed so (see this page of wiki for more details and then I used GNU Parallel to run my script with the right arguments on the different nodes and the different CPUs.
Now that I am at the University of Guelph, I have access to Graham, that is described on the wiki as follows:
GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after Wes Graham, the first director of the Computing Centre at Waterloo.
The parallel filesystem and external persistent storage (NDC-Waterloo) are similar to Cedar’s. The interconnect is different and there is a slightly different mix of compute nodes.
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.2
Graham’s OS is also CentOS:
|
|
The resource manager installed is
Slurm instead of TORQUE, so I had
to use a different script to run my jobs. Before presenting the script I wrote,
I’d like to show how I loaded the software I frequently use. Basically, I first
used module spider
to identify the module name and the versions available, for
example:
|
|
Once all the modules I needed were identified, I loaded them all:
|
|
and then saved them so that next time so I don’t have to reload them next times I log in.
|
|
Two other command are pretty useful:
module list
list loaded modulesmodule unload
that unload a loaded module.
That being done, and after installing specific R packages, I read the wiki page
“Running
jobs” to write the following bash script (launch.sh
):
|
|
And then typed sbatch launch.sh
to send the job to the scheduler. Note that
squeue -u username
allows one to monitor his job(s). Worked like a charm
😄! The #SBATCH
directives describe the resource required and
$SLURM_ARRAY_TASK_ID
distributes 1, 2, …, 12 to different jobs so that I
have a unique value per CPU, value that is the input of R script launcher.R
that determines which simulation will be run. A day of computation and it was
done!
To conclude this note, I’d like to mention a very helpful table that presents the main commands for the different resource managers used on the servers of Compute Canada : https://slurm.schedmd.com/rosetta.pdf. I found it very useful as I had some knowledge about TORQUE but not Slurm. Also, I’d like to point the reader to the following wiki page https://docs.computecanada.ca/wiki/R that explains how to run parallel jobs for R users.