In a study published in Science today, Berger and his colleagues put several of these strands together and use NLP to predict mutations in viruses that allow them to avoid being detected by antibodies in the human immune system, a process known as viral immune leakage. The basic idea is that the interpretation of a virus by an immune system is analogous to the interpretation of a sentence by a human.
“It’s a neat document, building on the momentum of previous work,” says Ali Madani, Salesforce Scientist, who is use NLP to predict protein sequences.
Berger’s team uses two different linguistic concepts: grammar and semantics (or meaning). The genetic or evolutionary ability of a virus – such as its ability to infect a host – can be interpreted in terms of grammatical correctness. A successful infectious virus is grammatically correct; failure is not.
Likewise, mutations in a virus can be interpreted in terms of semantics. A virus that mutates in a way that changes the way things in its environment see it – like mutations in its surface proteins that make it invisible to certain antibodies – has changed its meaning. Viruses with different mutations can have different meanings, and a virus with different meaning may need different antibodies to read it.
To model these properties, the researchers used an LTSM, a type of neural network that predates the transformer-based ones used by large language models like GPT-3. These older networks can be trained on much less data than Transformers and still perform well for many applications.
Instead of millions of sentences, they formed the NLP model out of thousands of genetic sequences taken from three different viruses: 45,000 unique sequences for a strain of influenza, 60,000 for a strain of HIV, and between 3,000 and 4,000 for a strain of Sars. -Cov- 2, the virus that causes covid-19. “There is less data on the coronavirus because there is less surveillance,” says Brian Hie of MIT, who built the models.
NLP models work by encoding words in a mathematical space such that words with similar meanings are closer in the model than words with different meanings; this is called an incorporation. For viruses, the inclusion of genetic sequences grouped viruses together based on the similarity of their mutations. This makes it easy to predict which mutations are more likely for a particular strain than for others.
The general objective of the approach is to identify mutations that could let a virus escape an immune system without making it less infectious, that is, mutations that change the meaning of a virus without making it grammatically incorrect. To test the tool, the team used a common metric to assess predictions made by machine learning models that score accuracy on a scale between 0.5 (no better than chance) and 1 (perfect). . In this case, they took the main mutations identified by the tool and verified how many of them were real escape mutations, using real viruses in a lab. Their results ranged from 0.69 for HIV to 0.85 for a strain of coronavirus. It’s better than other top models, they say.
Knowing what mutations might occur could help hospitals and public health authorities plan more easily. For example, asking the model to tell you how much a strain of flu has changed in meaning since last year would give you an idea of how well the antibodies people have already developed are going to work this year.
The team says they are now running models on new variants of the coronavirus, including the so-called British mutation, the Danish mink mutation and variants from South Africa, Singapore and Malaysia. They have found a high potential for immune evasion in almost all of them, although this has not yet been tested in nature. An exception is the so-called South African variant, which has raised concerns that it may escape vaccines but was not reported by the tool. They are trying to figure out why.
Using NLP speeds up a slow process. Previously, the virus genome taken from a covid-19 patient at the hospital could be sequenced and its mutations recreated and studied in a laboratory. But it can take weeks, says Bryan Bryson, a biologist at MIT, who also works on the project. The NLP model immediately predicts potential mutations, which concentrates lab work and speeds it up.
“It’s an amazing time working on this,” says Bryson. New viral sequences come out every week. “It’s great to simultaneously update your model and then run to the lab to test it in experiments. It is the best of computational biology. “
But this is only the beginning. Treating genetic mutations as changes in meaning could be applied in different ways across biology. “A good analogy can go a long way,” says Bryson.
For example, Hie believes their approach can be applied to drug resistance. “Think of a cancer protein that acquires resistance to chemotherapy, or a bacterial protein that acquires resistance to an antibiotic,” he says. These mutations can again be seen as changes in meaning. “There are many creative ways to start interpreting linguistic patterns.
“I think synthetic biology is on the cusp of a revolution,” says Madani. “We are now moving from just collecting data to learning to understand it in depth.”
Researchers are observing the progress of NLP and brainstorming new analogies between language and biology to take advantage of it. But Bryson, Berger, and Hie believe this cross could go both ways, with new NLP algorithms inspired by concepts from biology. “Biology has its own language,” says Berger.