To engineer proteins with useful functions, researchers normally begin with a natural protein that has a desirable function, resembling emitting fluorescent light, and put it through many rounds of random mutation that eventually generate an optimized version of the protein.
This process has yielded optimized versions of many essential proteins, including green fluorescent protein (GFP). Nonetheless, for other proteins, it has proven difficult to generate an optimized version. MIT researchers have now developed a computational approach that makes it easier to predict mutations that may lead to raised proteins, based on a comparatively small amount of information.
Using this model, the researchers generated proteins with mutations that were predicted to guide to improved versions of GFP and a protein from adeno-associated virus (AAV), which is used to deliver DNA for gene therapy. They hope it is also used to develop additional tools for neuroscience research and medical applications.
“Protein design is a tough problem since the mapping from DNA sequence to protein structure and performance is de facto complex. There is likely to be an important protein 10 changes away within the sequence, but each intermediate change might correspond to a completely nonfunctional protein. It’s like trying to seek out your solution to the river basin in a mountain range, when there are craggy peaks along the best way that block your view. The present work tries to make the riverbed easier to seek out,” says Ila Fiete, a professor of brain and cognitive sciences at MIT, a member of MIT’s McGovern Institute for Brain Research, director of the K. Lisa Yang Integrative Computational Neuroscience Center, and one in every of the senior authors of the study.
Regina Barzilay, the School of Engineering Distinguished Professor for AI and Health at MIT, and Tommi Jaakkola, the Thomas Siebel Professor of Electrical Engineering and Computer Science at MIT, are also senior authors of an open-access paper on the work, which might be presented on the International Conference on Learning Representations in May. MIT graduate students Andrew Kirjner and Jason Yim are the lead authors of the study. Other authors include Shahar Bracha, an MIT postdoc, and Raman Samusevich, a graduate student at Czech Technical University.
Optimizing proteins
Many naturally occurring proteins have functions that would make them useful for research or medical applications, but they need somewhat extra engineering to optimize them. On this study, the researchers were originally concerned about developing proteins that may very well be utilized in living cells as voltage indicators. These proteins, produced by some bacteria and algae, emit fluorescent light when an electrical potential is detected. If engineered to be used in mammalian cells, such proteins could allow researchers to measure neuron activity without using electrodes.
While many years of research have gone into engineering these proteins to supply a stronger fluorescent signal, on a faster timescale, they haven’t develop into effective enough for widespread use. Bracha, who works in Edward Boyden’s lab on the McGovern Institute, reached out to Fiete’s lab to see if they may work together on a computational approach which may help speed up the means of optimizing the proteins.
“This work exemplifies the human serendipity that characterizes a lot science discovery,” Fiete says. “It grew out of the Yang Tan Collective retreat, a scientific meeting of researchers from multiple centers at MIT with distinct missions unified by the shared support of K. Lisa Yang. We learned that a few of our interests and tools in modeling how brains learn and optimize may very well be applied within the totally different domain of protein design, as being practiced within the Boyden lab.”
For any given protein that researchers might wish to optimize, there’s a virtually infinite variety of possible sequences that would generated by swapping in numerous amino acids at each point inside the sequence. With so many possible variants, it’s unimaginable to check all of them experimentally, so researchers have turned to computational modeling to attempt to predict which of them will work best.
On this study, the researchers got down to overcome those challenges, using data from GFP to develop and test a computational model that would predict higher versions of the protein.
They began by training a kind of model often known as a convolutional neural network (CNN) on experimental data consisting of GFP sequences and their brightness — the feature that they desired to optimize.
The model was capable of create a “fitness landscape” — a three-dimensional map that depicts the fitness of a given protein and the way much it differs from the unique sequence — based on a comparatively small amount of experimental data (from about 1,000 variants of GFP).
These landscapes contain peaks that represent fitter proteins and valleys that represent less fit proteins. Predicting the trail that a protein must follow to succeed in the peaks of fitness might be difficult, because often a protein might want to undergo a mutation that makes it less fit before it reaches a close-by peak of upper fitness. To beat this problem, the researchers used an existing computational technique to “smooth” the fitness landscape.
Once these small bumps within the landscape were smoothed, the researchers retrained the CNN model and located that it was capable of reach greater fitness peaks more easily. The model was capable of predict optimized GFP sequences that had as many as seven different amino acids from the protein sequence they began with, and the perfect of those proteins were estimated to be about 2.5 times fitter than the unique.
“Once we now have this landscape that represents what the model thinks is nearby, we smooth it out after which we retrain the model on the smoother version of the landscape,” Kirjner says. “Now there’s a smooth path out of your place to begin to the highest, which the model is now capable of reach by iteratively making small improvements. The identical is usually unimaginable for unsmoothed landscapes.”
Proof-of-concept
The researchers also showed that this approach worked well in identifying latest sequences for the viral capsid of adeno-associated virus (AAV), a viral vector that is often used to deliver DNA. In that case, they optimized the capsid for its ability to package a DNA payload.
“We used GFP and AAV as a proof-of-concept to point out that it is a method that works on data sets which can be very well-characterized, and since of that, it must be applicable to other protein engineering problems,” Bracha says.
The researchers now plan to make use of this computational technique on data that Bracha has been generating on voltage indicator proteins.
“Dozens of labs having been working on that for 20 years, and still there isn’t anything higher,” she says. “The hope is that now with generation of a smaller data set, we could train a model in silico and make predictions that may very well be higher than the past 20 years of manual testing.”
The research was funded, partially, by the U.S. National Science Foundation, the Machine Learning for Pharmaceutical Discovery and Synthesis consortium, the Abdul Latif Jameel Clinic for Machine Learning in Health, the DTRA Discovery of Medical Countermeasures Against Latest and Emerging threats program, the DARPA Accelerated Molecular Discovery program, the Sanofi Computational Antibody Design grant, the U.S. Office of Naval Research, the Howard Hughes Medical Institute, the National Institutes of Health, the K. Lisa Yang ICoN Center, and the K. Lisa Yang and Hock E. Tan Center for Molecular Therapeutics at MIT.