Shared Memory High-Throughput Computing with Apache Arrow (TM)
Event Type
TimeTuesday, July 306:30pm - 8:30pm
LocationCrystal Foyer and Crystal B
DescriptionAs the barriers to entry in scientific computing have lowered with languages like Python and their libraries, the demand for ever more sophisticated and capable frameworks that provide almost turn-key functionality has grown proportionally (see keras [1]).

For researchers who use the pilot job [2], many-task [3] design pattern for high-throughput computing [4] on modern systems, the Apache Arrow project (Apache Software Foundation, 2019) with its now included Plasma in-memory object store provides a high-level interface for sharing data structures between processes in a way that requires no serialization or copying of that data.

This poster outlines a common scenario in which a constraint is induced by the way memory is managed. The direct sharing of memory by a common middle data layer allows for accelerated workflows that can operate at a more rapid cadence. A real-world example is provided; the goal is to raise awareness in facilitators at other major research computing centers who work with users to architect similar such high-throughput data pipelines.