Laura Richards
Data Scientist, Quilt Data Inc.
Actions
Laura is a computational biologist specializing in cancer genomics and bioinformatics. She holds a PhD in Medical Biophysics from the University of Toronto and has conducted genomics research in both academia (Princess Margaret Cancer Centre, The Hospital for Sick Children, Ontario Institute for Cancer Research) and industry (Celsius Therapeutics, Caris Life Sciences). Her primary research interests are identifying novel therapeutic targets and biomarkers for cancer diagnosis to advance personalized medicine. But new discoveries need data – sparking Laura’s venture into data management with Quilt packages to optimize research workflows.
Unifying Nextflow pipeline outputs and biological metadata with SQL and schema-on-read databases
Bioinformatics teams face challenges aggregating results and sample metadata across experiments and next-generation sequencing (NGS) runs, leading to unnecessary time spent record-keeping and data wrangling. Disparate data sources, inconsistent naming conventions, and diverse file formats complicate locating and linking NGS results with metadata. Consequently, fragmented datasets can obscure biological patterns and batch effects, visible only when data is unified and analyzed at scale.
Despite widespread use in data science, SQL is underused by the bioinformatics community. Familiar relational databases require users to predefine tables, slowing pipeline development. Schema-on-read databases, like AWS Athena and Google BigQuery, allow bioinformaticians to query directly over pipeline outputs in cloud storage, but only if output files adhere to specific folder structures.
In our session, we illustrate how SQL and schema-on-read databases can unify metadata with NGS results across runs to simplify data accessibility. We address two main implementation bottlenecks experienced by the community: (1) a lack of familiarity and tools to create table definitions for NGS data, and (2) the output folder structures of nf-core pipelines are typically incompatible with query-on-read databases, or inefficient for querying.
We provide examples of constructing table definitions and database views from common nf-core pipeline outputs (fetchngs, rnaseq) alongside queries that eliminate manual file wrangling time. For example, we processed RNA-seq data with metadata across multiple runs in CCLE and performed queries revealing scientific insights, like target gene expression across cancer types, rapidly with minimal code.
These techniques integrate into existing Nextflow infrastructures, streamlining bioinformaticians' access to unified datasets.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top