Covid-19 vaccine distribution agreement with pharmacies, measles cases hit 23-year high, & participation in cancer trials

November 15, 2020

Lab Chat: Cleaning up genetic data to protect privacy but get maximum use

As research using genetic data has accelerated in recent decades, scientists are trying to find ways to get the most out of the data while still preserving individuals’ privacy. In a new study, experts describe a program that allows them to “sanitize” and blur out any identifying genetic variants in available data. I spoke with Mark Gerstein and Gamze Gursoy, two authors of the study and bioinformatics researchers at Yale, to learn more:

What is the current problem with privacy, and which datasets have this problem? Gerstein: There is this binary view of privacy — either the data is locked or not locked. It’s hard to aggregate data when it’s locked down and what we’re trying to do in the paper is measure the amount of private information in there so we can just remove that [and access the rest].
Gursoy: Some examples of databases with this problem are the Cancer Genome Atlas, and even the one we manage, called PsychENCODE [for understanding the genetics of psychiatric disorders].

How did you cover up the private data?
Gursoy: We have a reference genome that represents everyone — but there is 1% that’s unique to each of us. So, if you see an “A” in the genetic code of the reference genome, but a “G” in the data you have, you change it to the “A.”
Gerstein: When you use Google’s Street View, the people on the street are unimportant to the information you’re trying to get about stores, etc. The Google car takes pictures of people’s faces, but then finds people in the images and blurs them out. Our process is similar. “}}