Providing a Unified Software Environment for Canada’s National Advanced Computing Centers
TimeThursday, August 110:30am - 11am
DescriptionExploiting an advanced computing platform consisting of several clusters distributed across the second-largest country in the world is challenging. Each cluster may run a different operating system, use a different generation of CPU, GPU, or network fabric, or be managed by a different team of system administrators. Presenting a unified software environment can tremendously facilitate the task of supporting researchers, but is challenging to implement. This is nevertheless what Compute Canada set out to do in 2016, in the midst of deploying a new generation of large clusters.
We had to find software solutions to solve the challenges involved to achieve this goal. Distribution, portability and performance were three important technical criteria for us. We also had to consider the practicality of each approach for our users, and reproducibility of software installations performed by staff located at various sites across Canada.
In this paper, we present the solution that we created, which has allowed Compute Canada to serve the needs of over 10,000 researchers across the country. This solution is used on over 20 different clusters with heterogeneous configurations, on processor architectures ranging from AMD's 2010 Magny-Cours to Intel's 2017 Skylake SP, with or without GPUs, with InfiniBand, Ethernet or OmniPath as the network fabric, and with Slurm or Torque/Moab as the scheduler. This stack provides a unified software environment to users, providing over 600 different scientific applications that are available in over 4,000 different combinations of version, compiler and CPU architecture.