January 16, 2021


The new data ‘sanitization’ technique obscures regions of a
participant’s genome in a dataset to secure her privacy, and may encourage more people to participate in genetic studies, says lead investigator Mark Gerstein, professor of biomedical informatics at Yale University.

“If someone hacks into your email, you can get a new email address; or if someone hacks your credit card, you can get a new credit card,” Gerstein says. “If someone hacks your genome, you can’t get a new one.”

To determine which information and how much of it should remain private to prevent a linkage attack, Gerstein and his colleagues performed linkage attacks on existing genetic datasets. In one sample attack, they compared two publicly available databases and RNA sequencing results to successfully identify 421 individuals.

In another linkage attack, Gerstein’s team sequenced the RNA of two volunteers and shuffled these data into a larger dataset. They then obtained DNA samples from the volunteers’ used coffee cups and sequenced their genomes. Again, they could link the two individuals to their genomes with a high degree of certainty.

Based on what they learned from the mock linkage attacks, Gerstein’s team developed a technique to mask some variants from a person’s genetic data while preserving where those variants are located in the genome. To do this, they replace the genetic variant of concern with one from a reference genome; which variants are removed depend on the genetic conditions or predispositions someone’s genetic data reveals.

Introducing too many of these privacy-masking variants can decrease the usefulness of the data. But Gerstein’s team struck a balance that enables researchers to obtain data on gene-expression values but also enables study participants to dictate how much of their genetic information they wish to keep hidden.


Lab Chat: Cleaning up genetic data to protect privacy but get maximum use

As research using genetic data has accelerated in recent decades, scientists are trying to find ways to get the most out of the data while still preserving individuals’ privacy. In a new study, experts describe a program that allows them to “sanitize” and blur out any identifying genetic variants in available data. I spoke with Mark Gerstein and Gamze Gursoy, two authors of the study and bioinformatics researchers at Yale, to learn more:

What is the current problem with privacy, and which datasets have this problem? Gerstein: There is this binary view of privacy — either the data is locked or not locked. It’s hard to aggregate data when it’s locked down and what we’re trying to do in the paper is measure the amount of private information in there so we can just remove that [and access the rest].
Gursoy: Some examples of databases with this problem are the Cancer Genome Atlas, and even the one we manage, called PsychENCODE [for understanding the genetics of psychiatric disorders].

How did you cover up the private data?
Gursoy: We have a reference genome that represents everyone — but there is 1% that’s unique to each of us. So, if you see an “A” in the genetic code of the reference genome, but a “G” in the data you have, you change it to the “A.”
Gerstein: When you use Google’s Street View, the people on the street are unimportant to the information you’re trying to get about stores, etc. The Google car takes pictures of people’s faces, but then finds people in the images and blurs them out. Our process is similar. “}}

Responsible, practical genomic data sharing that accelerates research

July 27, 2020

IMDB paper- revisited

January 20, 2020

IEEE awarded the IMDB paper “Test of Time” award and invited authors to write a revisit paper.

Can be found here:

