Jaclyn N. Taroni

Data Scientist at the Childhood Cancer Data Lab (CCDL)

Watch Webinar


Jaclyn Taroni recently joined the Childhood Cancer Data Lab, an Alex’s Lemonade Stand Foundation initiative dedicated to funding research for a cure for childhood cancer. As a Postdoctoral Fellow at the University of Pennsylvania’s Perelman School of Medicine, she worked at the Greene Lab, a team of scientists working to develop methods for analyzing big data to understand complex biological systems and provide these methods and data to every biologist.


At Dartmouth College’s Geisel School of Medicine, she studied systemic sclerosis, a rare autoimmune disease with no FDA-approved treatments, and developed novel frameworks for analyzing high-throughput molecular data. Dr. Taroni’s goal is to use unsupervised machine learning techniques to study the intersection of autoimmune and fibrotic disorders.

The CCDL’s mission is to empower scientists and doctors by creating tools that make data and analysis widely available, easily mineable and broadly reusable. The lab also trains scientists in putting these powerful tools to their best use. The CCDL is currently building the open-source refine.bio project to provide a constantly updated set of harmonized data on childhood cancer to the research community.

What are your goals at the Childhood Cancer Data Lab?

The 10,000ft overview is to build new tools and services to process gene-expression data not limited to children’s cancer, provide analysis to create pipelines, and run workshops to teach researchers how to analyze and use intra-machine learning.

You have worked on rare diseases as well as cancer. Is there a relationship?

Some childhood cancers are rare, so they present the same constraints and problems. You have a set of samples and learn where they fit in, at what stage of development.

What will you talk about at InnovationWell?

I will talk about processing. Everybody has done experiments in a particular way, asking particular questions related to their research, and there is a need to harmonize. We are building models across the board to handle ‘shovel-ready’ data.

What is one of the challenges?

The low number of samples is a big problem. Transcriptomic data is noisy and some of the noise has nothing to do with biology; it may come from the technology used or how the sample was prepped or even the institutions. You have to overcome the noise in a particular data set, separate biology from sample preparation, download the raw data and process it in uniform manner. That means processing one sample at a time, constantly adding to the compendia. And every time you add a new sample, you don’t want to reprocess, so we need to pick processes with general use, multi-sample methods. But you need to be programming savvy to get this in place.