In the face of unprecedented amounts of sequencing data generated by the genetic and genomic revolution, researchers have established a dizzying number of databases aiming to harness this new information. Sharing and pooling of such data have accelerated scientific progress, but questions remain about how the results will impact medical practice. Each of these databases has a different purpose and different structure—and different data. It is vital that oncologists and other health-care providers, who increasingly rely on genetic databases for information to predict patients’ inherited cancer risk, understand that these databases are not equivalent, and that only some contain clinical-grade data.
To understand the differences in databases, it is first necessary to know how gene variants correlate with pathogenicity. In general, variants are classified into one of five categories: known benign, likely benign, variant of uncertain clinical significance, likely pathogenic, or known pathogenic.
To detect rare variants, robust databases must include information beyond prevalence, such as family segregation studies, coinheritance patterns, in silico prediction, and functional studies.
The most rudimentary variant databases allow for pathogenicity assessment based on prevalence in a given patient population (e.g., women with breast cancer) compared with prevalence in an unaffected control population. This approach requires that disease-linked variants occur with sufficient frequency to allow meaningful statistical comparisons. While this is suitable for common variants, it is less useful for rare variants, such as the many missense mutations where one amino acid is substituted for another in the highly variant genes BRCA1 and BRCA2.
To detect rare variants, robust databases must include information beyond prevalence, such as family segregation studies, coinheritance patterns, in silico prediction, and functional studies. And, the databases must be regularly updated, such as when research reveals that a variant of uncertain significance is more likely to be benign or pathogenic. They should also account for differences in how laboratories classify a variant.
Curation differences are another concern. Curation is the process of monitoring data quality, completeness, and consistency. Curation can be prospective or retrospective, automated or manual. With prospective curation, contributors upload data into a quarantined area where curation occurs prior to allowing entry to the database. Retrospective curation allows open data uploads, after which attempts are made to identify duplicate or incorrect entries and other necessary changes. The word curation is perhaps used too loosely at times; employ caution when considering databases to ascertain the degree of curation performed.
One final hurdle to taking advantage of clinical-grade variant databases is ease of use. Identifying relevant information on a variant can require considerable expertise when a database consists of thousands or millions of variant entries.
Two archetypical variant databases are ClinVar, sponsored and maintained by the National Center of Biotechnology Information, and the Leiden Open Variant Database (LOVD). ClinVar and LOVD are open databases that allow an individual to enter a variant. The submission may be as simple as a representation of an allele and its interpretation. It may also include the pathogenicity assessment assigned to that variant by the contributor, along with literature references and other information used for the assessment. (See “The Genetic Components of Rare Disease,” The Scientist, July 2016.) Other recent BRCA databases, such as that championed by the American Society of Human Genetics and the University of Washington, may also help contribute understanding about BRCA science, if appropriate curation is in place.
A concern about open databases is that curation is minimal or nonexistent prior to database entry. In these cases, curation must be done retrospectively. If a mistake is identified, it is the contributor’s responsibility to correct the entry. For instance, ClinVar staff do not review all submissions, and the interpretation of the relationship of variation to health is provided only by the submitter. ClinVar and LOVD are laudable efforts to advance genetic research, but their utility as references, particularly for clinical use, is currently limited.
A concern about open databases is that curation is minimal or nonexistent prior to database entry. In these cases, curation
must be done retrospectively.
BRCA Share, which I helped launch in 2014, is designed to overcome some traditional limitations to variant databases. The database focuses on BRCA1 and BRCA2, genes that contain tens of thousands of unique variants. New and rare variants continue to be discovered, and the link between each and a woman’s risk of breast and ovarian cancers is still being resolved.
BRCA Share is based on the Universal Mutation Database (UMD) developed by INSERM, the French National Institute of Health and Medical Research. For more than a decade—while BRCA testing in the United States was limited to a single commercial laboratory thanks to a patent on the gene variants and testing technology—more than a dozen laboratories in France were performing BRCA testing and contributing their data to the UMD. Manual prospective curation is performed by INSERM’s curation team prior to uploading data to the UMD. Supporting data, such as family studies, prevalence in international databases, and occurrence in UMD (curated to eliminate duplicate entries from the same family), are also available.
BRCA Share is free to scientists, physicians, and others with a research focus; commercial laboratories must pay a fee on a sliding scale. After little more than a year of existence, BRCA Share has led to the identification of 334 pathogenic or likely pathogenic variants and classifications of 375 variants whose role in cancer risk was previously uncertain (The BRCA Share Consortium, 6th Biennial Meeting of The Human Variome Project, June 2016).
Beyond BRCA Share, other genetic databases show promise for clinical use, including BRCA Challenge and ClinGen. How well these initiatives evolve will shape the course of genetics research—and the quality of genetic patient testing—for years to come.
Charles M. Strom was one of the principal architects of BRCA Share. He is a full-time employee of Quest Diagnostics, and he owns stock in Quest Diagnostics. Quest Diagnostics is a founding member of BRCA Share.