The splicing code is just one part of the noncoding genome, the area that does not produce proteins. But it’s a very important one. Approximately 90 percent of genes undergo alternative splicing, and scientists estimate that variations in the splicing code make up anywhere between 10 and 50 percent of all disease-linked mutations. “When you have mutations in the regulatory code, things can go very wrong,” Frey said.
“People have historically focused on mutations in the protein-coding regions, to some degree because they have a much better handle on what these mutations do,” said Mark Gerstein, a bioinformatician at Yale University, who was not involved in the study. “As we gain a better understanding of [the DNA sequences] outside of the protein-coding regions, we’ll get a better sense of how important they are in terms of disease.”
Scientists have made some headway into understanding how the cell chooses a particular protein configuration, but much of the code that governs this process has remained an enigma. Frey’s team was able to decipher some of these regulatory regions in a paper published in 2010, identifying a rough code within the mouse genome that regulates splicing. Over the past four years, the quality of genetics data — particularly human data — has improved dramatically, and
machine-learning techniques have become much more sophisticated, enabling Frey and his collaborators to predict how splicing is affected by specific mutations at many sites across the human genome. “Genome-wide data sets are finally able to enable predictions like this,” said Manolis Kellis, a computational biologist at MIT who was not involved in the study.