Managing a Heterogeneous Cluster
TimeTuesday, July 301:30pm - 2pm
DescriptionMost HPC clusters are purchased with a large quantity of identical hardware, which is maintained through its lifecycle and then another HPC cluster takes its place. However, some clusters, like ours, are maintained by frequently adding new hardware, which is then integrated into the system. Over the years, the cluster has grown to include 339 compute nodes with 8220 cores from 6 vendors, spanning 4 generations of CPUs; 7 network technologies from 6 switch vendors (1GbE-100Gb OPA); 102 GPUs (3 different GPU models); 28 storage nodes (3.15 PB raw storage); and 7 virtualization nodes hosting 65 VMs.
Having such a diverse system has significant advantages, although the management is more difficult. This paper outlines our strategy of managing this very heterogeneous and complex system. Topics covered include software optimization, consistency of operating system updates, identity management, resource prioritization, network infrastructure, storage, and management of non-compute-intensive resources. Our combination of open source and internally developed software used to manage this cluster are a model to other heterogeneous systems and to smaller clusters which have not expanded because of management worries.