Harmony: A Harness Monitoring System for the Oak Ridge Leadership Computing Facility
Event Type
Cluster Management
TimeWednesday, July 3111am - 11:30am
LocationRegency AB
DescriptionAcceptance of a new system requires extensive testing and is often comprised of hundreds of tests. Summit, the latest flagship supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), and the number one system in the Top500 list, completed its acceptance testing in 2018. To execute acceptance, the acceptance test (AT) team utilizes the OLCF test harness, a tool developed at the OLCF that automates the launch and verification of all acceptance tests. Acceptance requires analysis of test results and classification of all test failures. The sheer number of tests involved makes performing these tasks challenging. To complete these tasks more efficiently, in addition to lessen the personnel burden during acceptance testing, we developed a harness monitoring system for the OLCF test harness called Harmony.

Harmony consists of three distinct modules: monitoring, recording, and reporting modules.

Harmony’s monitoring module ensures that tests launched by the harness are progressing in the job queue and restarted correctly after any failure. The recording system ingests results generated by the test harness into a database, and the reporting module provides a Django-based website to filter through tests, allowing staff to analyze, describe, and categorize any test failure.

Harmony is developed mainly in Python and uses MySQL, Django, and leverages the Slack and LSF APIs. The code is open source and publicly available. This paper presents Harmony’s design and fully describes its modules it provides. Harmony's modular design allows it to be customized for other testing purposes and to be used at different HPC centers.