Researchers at UC Santa Cruz have not only developed a novel data model that allows institutions to securely share private genomic information, but during testing they also managed to discover previously unknown variants associated with breast cancer risk, as well as positively address the lack of diversity in today’s genetic databases.
To say the research team led by James Casaletto, a Ph.D. candidate at UCSC’s Baskin School of Engineering and the paper’s lead author, was busy would be an understatement.
Understanding the clinical significance of rare genetic variants requires analyzing large amounts of genomic and clinical data. Privacy policies, however, restrict the sharing of this information between institutions, with no one institution likely to have all the resources needed for a robust analysis. Thus, while an incredible amount of human health data exists, it lives in millions of siloes, leaving scientists without the ability to puzzle the pieces together.
Additionally, engineering the software needed to execute genomic analyses is complex and usually cannot be undertaken by the average geneticist.
To solve this problem—and more—Casaletto and his team created and successfully employed an approach called federated analysis. In this approach, researchers “bring the code to the data,” avoiding the need to export sensitive data at all.
UCSC Genomics Institute software is sent in a “container” to any collaborating institution around the world that is home to a valuable but protected set of genomic data. The collaborating institution then uses the software to analyze their data within their institution’s secure environment, generating summary data that does not reveal personal information about individual patients.
The “container” ensures patient-level data meets the strict privacy rules there for the benefit of the patient, but allows researchers to collect a much wider pool of genomic data—which can lead to better clinical conclusions. The federated analysis method also eliminates issues of uploading, downloading and relocating huge data sets, which can be logistically prohibitive.

This graphical abstraction of the process of federated analysis shows how the process moves from software development at the UC Santa Cruz Genomics Institute to variant classification. Credit: James Casaletto
For this project, UCSC researchers collaborated with the RIKEN Center for Integrative Medical Sciences in Japan to analyze their biobank of BRCA1 and BRCA2 genomic data. When mutated, the inherited BRCA1 and BRCA2 genes are known to lead to an increased risk of breast, ovarian and other cancers.
Using the data method, the researchers made multiple discoveries about which specific variants in the BRCA1 and BRCA2 genes lead to cancer and which leave patients unaffected, moving the needle on a number of previously uncertain variants. This is the first application of federated analysis to enable classification of previously unclassified genetic variants.
“[The paper is] a proof of concept that we have this container technology, we’ve leveraged it for BRCA1 and BRCA2, we’ve also demonstrated in the research that it can be used for other genes—genotypes and phenotypes,” said Casaletto.
In addition, by partnering with RIKEN and integrating more Japanese genomic data, the research team was able to diversify the mostly white database.
“The genetics of white people are highly over-represented, the genetics of non-white people are much more of a mystery, due to a lot of historical biases in data collection,” said Melissa Cline, a research scientist at the UCSC Genomics Institute. “We were able to add a little more knowledge on Japanese genetics than was previously available.”
While, Cline and Casaletto’s research had an immediate affect on the diversity of data, the widespread use of federated analysis in the future could make a monumental difference.
“Further collaboration using federated analysis with institutes worldwide could similarly do much to address the lack of representation of non-white people and empower institutions that may be resource-poor to contribute to the global genomic data pool,” said Cline. “What's been done in the past is basically a lot less data sharing, so the name of the game is really global data sharing.”