A Resource Utilization Analytics Platform Using Grafana and Telegraf for the Savio Supercluster
TimeWednesday, July 3112pm - 12:30pm
DescriptionUnderstanding high performance computing cluster utilization patterns is key for decision making and efficient resource allocation. Cluster utilization statistics are useful for both cluster administrators and users, providing the ability to identify potential issues, plan jobs to minimize time spent in the queue, and identify constrained resources where funds might be focused in the future. Much of the existing data is accessible by the command line, but presenting the statistics visually allows for easier identification of trends and interactive exploration. Programs such as XDMoD exist to perform a similar function, but are not as well-suited towards use-cases of smaller clusters. Instead, we use a stack of open source software, allowing for a high degree of flexibility, including multiple database backends and more user-friendly querying and visualization. Based on open source software, similar software stacks can be deployed at other clusters to fit their existing infrastructure and needs. We present the workflow used by Berkeley Research Computing to consolidate existing data, collect additional utilization information, and display the relevant charts for different use-cases on the Savio supercluster. We have begun collecting, integrating, and displaying job information from Slurm; account association information; and CPU metrics collected with Telegraf. With this data, the Grafana visualization framework is able to present broad summary statistics, such as aggregated usage by campus department, all the way down to the CPU usage on a single node over the course of a job, and all with flexibility for a wide variety of possible needs and use-cases.