ENCODE and the dance between genes and DNA/RNA
Since 2003, the lab of Yale’s Mark Gerstein has played a major role in an international effort to catalog data on the complex interactions between genes and the segments of DNA and RNA that regulate their functions. The latest findings of the ENCODE project were published July 29 in 30 papers, four spearheaded by Gerstein’s lab, in a variety of scientific journals. Jing Zhang and Donghoon Lee from Gerstein’s lab have created a video illustrating science’s evolving understanding of the complex regulatory networks that can contribute to cancer and other diseases. The latest findings by the Gerstein lab and other major ENCODE contributors can be found on the Gerstein lab website. “}}

But it’s the very specificity of genomic data that threatens privacy. Although most genomic databases strip away any information linking a name to a genome, such information is very hard to keep anonymous. “I’m not convinced you can truly de-identify the data,” says Mark Gerstein, a Yale professor who studies large genetic databases and is a fierce privacy advocate. He is concerned about whether even the most cutting-edge protections can safeguard personal data. “I am not a believer that large-scale technical solutions or ‘super-encryption’ will solely work,” he says. “There also needs to be a process for credentialing the individuals who access this data.”

Threats to privacy could multiply once there is an active market for genetic data. Wood speculates that it could be valuable to life insurance companies, which could use it to raise your premiums; or it could become a tool for those who want to prove or disprove paternity. White nationalist groups, who have become preoccupied with genetic testing, might find a way to weaponize the ancestry data the tests can show. It would not be the first time genetic information was used against a race or races. “Genetics has a very troubled history, from Darwin on,” says Yale’s Mark Gerstein.

Yet Columbia’s Yaniv Erlich and others, including Church, fear differential privacy could compromise biomedical research, with smudged data making it harder to get clear results. Mark Gerstein at Yale believes that scientists would be better off testing hypotheses on small amounts of publicly available but pure data, even if it’s not representative of the overall population, rather than using larger quantities of imperfect data.

Genetic tests and genome sequencing are generating terabytes of sensitive private data. How can they be kept safe?

“The dark proteome could be an evolutionary playground for trying out new folds

Ultimately one would expect particularly useful variations to get fixed at the genetic level. But it needn’t be where that variation begins. What’s more, organisms needn’t be quite so dependent for their molecular repertoire on their evolutionary heritage. O’Donoghue thinks that all organisms probably have a significant fraction of proteins unique just to them.

‘The fact that the dark matter of the proteome has less evolutionary constraint than the other bits of proteome may suggest that it’s under less selection,’ says Gerstein. ‘This is perhaps because it’s more flexible structurally, but also in a sense more flexible in terms of accommodating various amino-acid changes compared to the structurally inflexible and fixed parts of the crystallised proteome.’ This adds momentum to the picture of genomics as a rather more fluid affair than is suggested by the old picture of identical proteins being
mass-produced from a fixed genetic template.

Gerstein feels that studying the dark proteome opens up a host of interesting questions. For example, although known bacteria have a smaller dark proteome than eukaryotes, there’s a huge ‘dark
microbiome’ of unculturable bacteria. Might that be more full of dark proteins – perhaps useful ones?

And what about us? ‘How does the human dark proteome compare to that of eukaryotes as a whole?’ Gerstein wonders. How well, really, do we know ourselves?”

