Leveraging XDMoD Job Statistics Data to Predict Performance
TimeTuesday, July 306:30pm - 8:30pm
LocationCrystal Foyer and Crystal B
DescriptionA collection of HPC resources, XSEDE was responsible for 24.8 million jobs in 2018, making up 7.8 billion CPU hours and using an average of 72 processors per job. User allocations on XSEDE High Performance Computing resources are made based on code performance and based on projections for the planned simulations in a project. Metrics on XSEDE jobs are collected via an open-source tool called XDMoD leveraging the job performance module SUPReMM and available to aid users in analysis of their performance. The SUPReMM data collection is extensive with CPU (various performance counters), memory, IO (block device statistics, Lustre, NFS filesystems info), and system info collected at 10 minute intervals. Typically users do not delve into details of their jobs (which can number in the thousands). The goal of this work is to propose and demonstrate the potential use of this data for the purpose of predicting job performance, and analyze the feasibility of predicting overall CPU utilization of a job based on a fraction of the temporal data from the early section of a run. Potentially this can flag slow running jobs in the early stages and make resource usage more efficient. The analysis is performed for a dataset of Comet jobs longer than 1 hour wall clock time (to provide enough time samples). Given the extensive list of parameters, a PCA analysis is performed for feature reduction. Several regression approaches, deep neural network based models are considered. TensorFlow and sklearn machine learning frameworks are used in the work.