CHURP: A lightweight CLI framework to enable novice users to analyze sequencing datasets in parallel
Event Type
TimeTuesday, July 306:30pm - 8:30pm
LocationCrystal Foyer and Crystal B
DescriptionProgressive decreases in the cost of DNA sequencing have contributed to a decades-long exponential increase in the production of new sequencing datasets. The processing of these datasets has in turn led biology, a field that has traditionally relied on local ‘lab’ servers to address its computational needs, to become increasingly reliant on High Performance Computing (HPC) resources. Though many operations on sequencing datasets are trivially parallelizable on multiple levels, the lack of an HPC tradition in biological research has hampered fully parallelized deployments.

Here we present a lightweight flexible framework for performing parallelized processing of raw gene expression data. The framework uses a Python3 based frontend for specifying analysis options, data paths and reference datasets. This frontend sanitizes and resolves the options, providing verbose error checking before writing a human readable configuration file and basic scripts for batch submission. The submission scripts leverage the scheduler to implement a scatter-gather approach, submitting potentially hundreds of individual jobs via a job array, each small enough to take advantage of backfill in a high contention HPC environment. The gather component is handled through a script submitted with an ‘after-okay’ dependency.