How to Accelerate Your Big Data and Associated Deep Learning Applications with Hadoop and Spark?
Event Type
TimeMonday, July 298:30am - 12pm
DescriptionApache Hadoop and Spark are gaining prominence in handling Big Data analytics. Recent studies have shown that default Hadoop and Spark can not leverage the high-performance networking and storage architectures efficiently, like Remote Direct Memory Access (RDMA) enabled high-performance interconnects and heterogeneous storage systems (e.g. HDD, SSD, NVMe-SSD, and Lustre). These middleware are traditionally written with sockets and do not deliver the best performance on modern high-performance networks. In this tutorial, we will provide an in-depth overview of the architecture of Hadoop components (HDFS, MapReduce, etc.) and Spark. We will examine the challenges in re-designing networking and I/O components of these middleware with modern interconnects, protocols (such as InfiniBand and RoCE) with RDMA and storage architectures. Using the publicly available software packages in the High-Performance Big Data (HiBD, project, we will provide case studies of the new designs for several Hadoop/Spark components and their associated benefits. Through these case studies, we will also examine the interplay between high-performance interconnects, high-speed storage systems, and multi-core platforms to achieve the best solutions for these components, Big Data processing, and Deep Learning applications on modern HPC clusters. This tutorial will provide hands-on sessions of Hadoop and Spark on SDSC Comet supercomputer.