How a scientist learned chemistry at the AlphaFold AI

Artificial intelligence has changed the way science is done by enabling researchers to analyze the vast amounts of data that modern scientific instruments generate. It can find a needle in a million haystacks of information and with the help of deep learning, it can learn from the data itself. AI is accelerating progress in gene hunting, medicine, drug design and the creation of organic compounds.

Deep learning uses algorithms, often neural networks trained on large amounts of data, to extract information from new data. It is very different from traditional computing with its step-by-step instructions. Instead, it learns from data. Deep learning is much less transparent than traditional computer programming and leaves open important questions: what has the system learned, what does it know?

As a chemistry professor, I like to design tests with at least one difficult question that broadens students’ knowledge to determine whether they can combine different ideas and synthesize new ideas and concepts. We came up with such a question for the AI ​​advocates’ banner, AlphaFold, who solved the problem of protein folding.

folding protein

Proteins are present in all living organisms. They give structure to cells, catalyze reactions, transport small molecules, digest food and do much more. They are made up of long chains of amino acids, like beads on a string. But for a protein to do its job in the cell, it must twist and bend in a complex three-dimensional structure, a process called protein folding. Misfolded proteins can lead to disease.

Christiaan Anfinsen posited in his 1972 Nobel Prize for Chemistry that it should be possible to calculate the three-dimensional structure of a protein from the sequence of its building blocks, the amino acids.

Just as the order and spacing of the letters in this article give it meaning and message, so the order of the amino acids determines the identity and shape of the protein, which results in its function.

Due to the inherent flexibility of the amino acid building blocks, a typical protein can take an estimated 10 to 300 different shapes. This is a huge number, more than the number of atoms in the universe. But within a millisecond, each protein in an organism will fold into its own specific shape — the lowest energy rank of all the chemical bonds that make up the protein. Change just one amino acid out of the hundreds of amino acids typically found in a protein and it can misfold and stop working.


For 50 years, computer scientists have been trying to solve the problem of protein folding – with little success. In 2016, DeepMind, an AI subsidiary of Google’s parent company Alphabet, started its AlphaFold program. It used the protein database as a training set, which contains the experimentally determined structures of more than 150,000 proteins.

In less than five years, AlphaFold had overcome the problem of protein folding—at least the most useful part of it, which is determining protein structure based on amino acid sequence. AlphaFold does not explain how the proteins fold so quickly and accurately. It was a big win for AI, because not only did it gain tremendous scientific prestige, it was also a major scientific advancement that could affect everyone’s lives.

Thanks to programs like AlphaFold2 and RoseTTAFold, researchers like me can now determine the three-dimensional structure of proteins from the sequence of amino acids that make up the protein – at no cost – in an hour or two. Before AlphaFold2, we had to crystallize the proteins and solve the structures using X-ray crystallography, a process that took months and cost tens of thousands of dollars per structure.

We also now have access to the AlphaFold Protein Structure Database, where Deepmind has deposited the 3D structures of nearly all proteins found in humans, mice and more than 20 other species. To date, they have solved over a million constructions and plan to add 100 million constructions this year alone. The knowledge of proteins has skyrocketed. The structure of half of all known proteins is likely to be documented by the end of 2022, including many new unique structures associated with new useful functions.

Think like a chemist

AlphaFold2 was not designed to predict how proteins would interact with each other, but it has been able to model how individual proteins combine to form large complex units composed of multiple proteins. We had a challenging question for AlphaFold: did the structural training set teach it some chemistry? Could it tell if amino acids would react with each other – a rare but important event?

I am a computer chemist interested in fluorescent proteins. These are proteins found in hundreds of marine organisms such as jellyfish and coral. Their glow can be used to illuminate and study diseases.

There are 578 fluorescent proteins in the protein database, 10 of which are “broken” and do not fluoresce. Proteins rarely attack themselves, a process called autocatalytic post-translational modification, and it’s very difficult to predict which proteins will react with themselves and which won’t.

Only a chemist with a significant amount of knowledge of fluorescent proteins would be able to use the amino acid sequence to find the fluorescent proteins that have the correct amino acid sequence to undergo the chemical transformations necessary to make them fluorescent. When we presented AlphaFold2 with the sequences of 44 fluorescent proteins that are not in the protein database, it folded the fixed fluorescent proteins differently than the broken ones.

The result surprised us: AlphaFold2 had learned some chemistry. It had discovered which amino acids in fluorescent proteins do the chemistry that makes them glow. We suspect that the protein database training set and the alignment of multiple sequences allow AlphaFold2 to “think” like chemists and search for the amino acids needed to react with each other to make the protein fluorescent.

A folding program that learns some chemistry from its training set also has broader implications. What else can be gained from other deep learning algorithms by asking the right questions? Can facial recognition algorithms find hidden markers for diseases? Could algorithms designed to predict consumer spending patterns also find a propensity for petty theft or cheating? And most importantly, is this capability—and similar skill leaps in other AI systems—desirable?

Marc Zimmer is a professor of chemistry at Connecticut College.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

This post How a scientist learned chemistry at the AlphaFold AI was original published at “”

Leave a Reply

Your email address will not be published.