Tools and Best Practices for Distributed Deep Learning with High Performance Computing
TimeMonday, July 291:30pm - 5pm
DescriptionThis tutorial is a practical guide on how to run distributed deep learning over distributed compute nodes effectively. Deep Learning (DL) has emerged as an effective analysis method and are adapted quickly across many scientific domains in recent years. Domain scientists are embracing DL as both a standalone data science method, as well as an effective approach to reducing dimensionality in the traditional simulation. However. due to its inherent high computational requirement, application of DL is limited by the available computational resources. Recently, we have seen the fusion of DL and high-performance computing (HPC): supercomputers show an unparalleled capacity to reduce DL training time from days to minutes; HPC techniques have been used to speed up parallel DL training. Therefore distributed deep learning has great potential to augment DL applications by leveraging existing high performance computing cluster. This tutorial consists of three sessions. First, we will give an overview of the state-of-art approaches to enabling deep learning at scale. The second session is an interactive hands-on session to help attendees running distributed deep learning with resources at the Texas Advanced Computing Center. In the last session, we will focus on the best practices to evaluate and tune up performance.