Q&A: A 10,000-Genome Milestone for Shared Pediatric Cancer Data  

Computational biologist Jinghui Zhang of St. Jude realized scientists could work more efficiently with tools and genomic data shared on the cloud.

Apr 1, 2019
Carolyn Wilke
ST. JUDE CHILDREN’S RESEARCH HOSPITAL / PETER BARTA

Jinghui Zhang is a driving force behind the St. Jude Cloud, a platform that hosts pediatric cancer genomes as well as data analysis tools for researchers. At the American Association for Cancer Research 2019 meeting this week in Atlanta, Georgia, Zhang and her colleagues from St. Jude Children’s Research Hospital are celebrating reaching 10,000 genomes uploaded to the platform. They are also promoting a new genome browser called GenomePaint designed to integrate genomic, transcriptomic, and epigenomic data from patients. 

The Scientist spoke with Zhang to hear about her vision for the St. Jude Cloud and how the platform has changed over time to become the database with the largest number of whole genomes collected from pediatric cancer patients and survivors.  

The Scientist: What was the inspiration behind the St. Jude Cloud?

Jinghui Zhang: I actually had to download data from the public repository and what we recognized is the effort and the challenges to download public data sets. If we can upload data on the cloud and then scientists can bring their tools on the cloud, then we don’t need to do the data download anymore. 

I’m also a tools developer, so I have to share my tools with the global research community. We can upload our tools on the cloud and you just do one installation and everyone can use it directly. 

The third component is the data visualization. If we put the data in the form that people can visualize, then researchers who do not have computational skills can start to analyze the data.

TS: How has the St. Jude Cloud evolved over the years? 

JZ: We started our prototype in 2016. Initially, we were only focusing on putting tools there. From 2017 is where the major evolution comes. It’s where Microsoft decided to collaborate with us, so we were able to gain the cloud storage space for hosting these datasets. 

The first milestone came in 2018, [when] were able to put 5,000 whole genomes on the cloud.  The AACR this year will be the second milestone of the 10,000 whole genomes in place. 

TS: Who uses the St. Jude Cloud?

JZ: There are three separate research entities. Most obvious are the research institutions involved in pediatric cancer research across the globe. The second group [are those] involved in genomic research in general because our resource is a very genomic dataset. A third group is really more interested in our tools. There are groups not even involved in cancer research but generic human disease research [that] are interested in utilizing our tools for their specific disease. 

TS: What have been the effects of the St. Jude Cloud on pediatric cancer research? 

JZ: We have a total of 800 registered users from 400 institutes globally having accessed the data. I already know from one group in Australia, they mentioned they are using our visualization tools every day for their clinical work. 

For our own clinical genomics program at St. Jude, we have developed this rapid RNA-seq tool that we’re using for fusion gene detection. [Editor’s note: Fusion genes are hybrids of genes that can cause cancer.] And this tool is used to actually determine whether the cancer patient is going to be put in one branch of the clinical trial verses the other, and it requires a very quick turnaround time. With cloud computing, it really cuts down data analysis from initially up to one week to several hours. That enables us to meet the timeline for making a decision on the clinical trial. 

TS: What’s next for the platform?

JZ: There are a couple of things. One is expanding [to] more patients, more samples, expanding the data resource. The second thing is to incorporate additional epigenetic data that we generated from pediatric cancer cell lines so that we can facilitate the interpretation of the noncoding variance. The third aspect of this is . . . we are discussing with the National Cancer Institute about establishing a federated data system, so our database can communicate with the resources built by NCI [like their data portal], so that we don’t build up the databases in silos.

Editor’s note: This interview was edited for brevity.