Nature didn’t evolve all the proteins we need, but maybe artificial intelligence can help. Jacob and Saloni explore how tools like AlphaFold and ProteinMPNN are helping researchers re-engineer proteins, to make them safer, more stable, and more effective. They talk about how new technologies could help make a long-sought vaccine against Strep A, which causes scarlet fever and rheumatic heart disease, and how similar tools have already led to breakthroughs against COVID and RSV.
Hard Drugs is a new podcast from Works in Progress and Open Philanthropy about medical innovation presented by Saloni Dattani and Jacob Trefethen.
You can watch or listen on YouTube, Spotify, or Apple Podcasts.
Saloni’s substack newsletter: https://www.scientificdiscovery.dev/
Jacob’s blog: https://blog.jacobtrefethen.com/
Courses:
EMBL-EBI. AlphaFold: A practical guide https://www.ebi.ac.uk/training/online/courses/alphafold/
Articles:
Monica Jain et al. (2022) Exosite binding modulates the specificity of the immunomodulatory enzyme ScpA, a C5a inactivating bacterial protease. https://pmc.ncbi.nlm.nih.gov/articles/PMC9464890/
Jakki Cooney et al. (2008) Crystal structure of C5a peptidase https://www.rcsb.org/structure/3EIF
Hui Li et al. (2017) Mutagenesis and immunological evaluation of group A streptococcal C5a peptidase as an antigen for vaccine development and as a carrier protein for glycoconjugate vaccine design https://pubs.rsc.org/en/content/articlelanding/2017/ra/c7ra07923k
Lectures:
Rosetta Commons (2024) AlphaFold – ML for protein structure prediction
Rosetta Commons (2024) MPNN – ML for protein sequence design
Acknowledgements:
Aria Babu, editor at Works in Progress
Graham Bessellieu, video editor
Rachel Shu, on-site editor
Anna Magpie, fact-checking
Abhishaike Mahajan, cover art
Atalanta Arden-Miller, art direction
David Hackett, composer
Works in Progress & Open Philanthropy
Transcript
Jacob Trefethen:
Proteins are abundant in our body, doing lots of different and amazing things to keep us alive. They are a big part of medicines too, like synthetic insulin, which we talked about last episode, antibodies, and protein vaccines – just to name a few. And over the last decade, it’s become possible to use AI to predict the structure of proteins much more accurately, and use that knowledge to design new proteins that have never been seen before in nature, improve existing proteins to use for medicines, other industries, agriculture and more.
Saloni Dattani:
Welcome to Hard Drugs. I’m Saloni Dattani, and this is Jacob Trefethen. I’m a co-founder at Works in Progress magazine and Jacob leads science and global health R&D funding at Open Philanthropy. And in today’s episode we’re talking about proteins – how to improve proteins and how to design new proteins.
Jacob Trefethen:
Last episode, you taught me about insulin – a protein that I know is pretty useful and the body makes naturally, and that’s also useful if you make it outside of the body, and provide it as a medicine for people who need it.
Saloni Dattani:
Right. It’s used for diabetes – insulin helps control blood sugar levels. So in the episode we just did, we talked about how it used to be extracted from animals in the early 20th century. Then, in the 1970s, people developed a way to reproduce it in bacteria that were used in this giant bacterial churning soup machine [a bioreactor], in such a way that you could scale up this protein much more than you could with animals – that are being factory farmed – so that they can be used for many more diabetes patients around the world.
Jacob Trefethen:
That was a game changer over the last fifty years. Today, we want to speed through to the last five years, really, and talk about some game-changers of new AI technologies that help you not just take a protein nature has designed and reproduce it, but actually tweak or improve a protein that nature has designed for an even more useful medical purpose. That really started getting more and more possible with tools invented since 2020.
Saloni Dattani:
Whoa, that’s very recent.
Jacob Trefethen:
It’s very recent and people probably have heard of AlphaFold. AlphaFold2 came out in, or was first used, in 2020. And some of the other tools that help you improve on existing proteins that were made since then, 2022 and since, so.
Saloni Dattani:
But first...
Jacob Trefethen:
Yes.
Saloni Dattani:
Why proteins? Why are proteins so cool? We talked about it a little bit in a previous episode.
I think there are a few things. One is, a single protein could be doing lots of different things. You can have modular parts of proteins that are doing a bunch of things together: maybe one part is doing an enzymatic reaction, another one is binding to something that tells it how to speed up that reaction or to slow it down. Maybe there are some other signalling parts that are like, “Hey, protein, stop working now.” and stuff like that. So that’s one.
I think the other is, yeah, it reacts to the environment – changes in temperature, or acidity, or things like that could change how much the protein is doing something. Also they’re quite small. If you were a chemist who was trying to make some reactions happen, and you’re doing a series of reactions in a bunch of machines, a protein can do that at a tiny scale. This tiny protein is doing all that stuff. It’s super specific as well. It could bind to a tiny molecule, or a metal, or another protein or another receptor or so many different things, and it’s super specific. So, lots of reasons.
Jacob Trefethen:
Lots of reasons. And I want to convince you now that that’s not enough in that-
Saloni Dattani:
Ooh. I wasn’t expecting that.
Jacob Trefethen:
-nature has not evolved all of the proteins we might find useful in our bodies, and that’s why these new AI tools are pretty useful.
Saloni Dattani:
Okay, so the potential uses of proteins are very, very large, I think.
Jacob Trefethen:
Yes.
Saloni Dattani:
But the ones that we see in nature are not necessarily, and maybe they’ve only evolved to do certain things in certain environments. Is that right?
Jacob Trefethen:
Yep, exactly. And I want to make it specific by talking about... like we had insulin last episode, let’s talk about a medicine that we don’t yet have that we might be able to make with the help of new tools.
Saloni Dattani:
What’s the protein?
Jacob Trefethen:
I picked one called ScpA.
Saloni Dattani:
ScpA.
Jacob Trefethen:
This is a protein which is on the outside of a bacteria – the bacteria is Strep A. As some of my friends are well aware, I am a bit-
Saloni Dattani:
You’re a fan boy.
Jacob Trefethen:
I’m a fan boy, or really, I hate-
Saloni Dattani:
Well, you’re a hater.
Jacob Trefethen:
I’m a hater of Strep A, and the reason is that Strep A, as a bacteria, kills half a million - or maybe more than half a million people a year, because it can lead to different diseases if you get repeated infections.
So in particular, it leads to rheumatic heart disease, which used to kill a lot of people in the US, the UK, and other high income countries; still kills a lot of people in lower and middle income countries. As a side note, Strep A is also the bacteria that- have you ever seen that flesh eating stuff that goes up your leg and then eats your whole body?
Saloni Dattani:
Ooh. Well, I guess I’ve seen a few different diseases that do that, but yeah, that’s scary.
Jacob Trefethen:
Yeah, Strep A is one of the biggies.
Saloni Dattani:
And wait, doesn’t it also- isn’t that also Strep throat?
Jacob Trefethen:
Yes, Strep throat - and Scarlet fever and different names for these diseases - yeah.
Saloni Dattani:
I think the one thing that I know about this is: the bacteria maybe first causes an infection of the throat - so it first causes Scarlet fever or Strep throat - and then maybe later on, people develop heart disease from it.
Jacob Trefethen:
Yes, exactly. Because what happens is, your body’s making an immune response to that bacteria that it can actually get confused and start attacking your own heart valves, which... not so good.
Saloni Dattani:
That’s terrible. Yeah, you also wouldn’t usually- I think the classical way people think about infections is not that they’re linked to heart disease, but in fact, there are many ways that they can be connected.
Jacob Trefethen:
Absolutely. So, we don’t have a vaccine against this. It would be pretty nice if we had one. The way that you make vaccines often, in a modern context, is you try and take protein antigens from a given pathogen.
Saloni Dattani:
What is an antigen?
Jacob Trefethen:
An antigen is a part of, in this case, this bacteria, that prompts a immune response in your body. So there’s many different things that are involved in Strep A the bacteria reproducing and living, but some of them are better things to target, for your antibodies, than other parts of your immune system.
Saloni Dattani:
So if I was trying to recognise you-
Jacob Trefethen:
Yes.
Saloni Dattani:
-from far away, it would be more helpful for me to know what your face looks like than what your clothes look like. Because you might change your clothes.
Jacob Trefethen:
That’s exactly right. And sometimes if I’m really stealthy, I might change my face, but that is a bit more costly.
Saloni Dattani:
Ooh. I used to watch these- well, I didn’t watch them- but on Indian television, there were soap operas that would be ongoing for years or decades. Sometimes the actors would not be interested in working on that show anymore, and they’d leave the show, and the producers would replace the actor with someone else!
Jacob Trefethen:
That’s what we call a stealth pathogen.
Saloni Dattani:
So they would have to explain what happened, and usually it would be like they had cosmetic surgery, or they got into an accident. So they write off the old actor from the show, but they still want the character in the show, so they just get someone else to play them. And they’re like, “Oh... now they look different.”
Jacob Trefethen:
I’m always impressed that the main characters in Harry Potter were still the main actors by the end because that was what a decade of- anyway, so someone doing good.
Saloni Dattani:
That’s impressive.
Jacob Trefethen:
Well done.
Jacob Trefethen:
So I’m Emma Watson.
Saloni Dattani:
You’re Emma Watson?!
Jacob Trefethen:
No, I’m not, I’m actually Jacob, but I am Strep A. That’s what I-
Saloni Dattani:
Okay- oh??
Jacob Trefethen:
-and we are trying to find antigens on the outside of Strep A that could be used as a vaccine, or as part of a vaccine. There’s a lot of attempts to do this already, so you might not end up needing AI, but AI is starting to help. And I’ll walk you through an experience I had, of sort of playing around with these tools to get to grips with them a bit more.
So I went to visit the University of Washington in March this year, and got to spend a couple weeks at the Institute for Protein Design, which focuses on a lot of protein design tools and has invented some themselves. And while I was there, I focused on this one antigen – ScpA – and tried to use a tool called ProteinMPNN to revise it, to make an even better antigen, or immunogen, for the immune system.
Saloni Dattani:
Okay. So let’s say again, I am trying to recognise you from far away.
Jacob Trefethen:
Yes.
Saloni Dattani:
And I know what your face looks like.
Jacob Trefethen:
Yes.
Saloni Dattani:
Why would I want to improve that?
Jacob Trefethen:
Okay, so let’s start with what ScpA looks like. This is a protein that’s stuck on the outside of the bacteria. Now that’s a good start, because your antibodies can access what’s on the outside of a bacteria much more than they can access what’s on the inside.
Saloni Dattani:
Right.
Jacob Trefethen:
So that’s a good start.
Saloni Dattani:
They don’t fit in. They can’t get in.
Jacob Trefethen:
They don’t get in. Now, you might say, okay, great, why don’t I just take that thing and use it as a vaccine? Well, you could try that, and people do try that. You can make changes, though, that improve on the properties of that protein from the point of view of making an actual vaccine - making a product that you could inject.
For example, any given protein, it’s not guaranteed to be that easy or cheap to make. You can make it in different systems - different bacteria, in yeast - and if you have to make a protein in mammalian cells, for example, instead of in yeast or in bacteria, that’s going to be more expensive. That’s not great from a product development point of view.
It also might not be as stable as it is in its natural occurrence, on the outside of a bacteria. If you’re plucking it out of this membrane and you’re just putting it in and, well, are you sure that’s going to not just clump up together with other versions of itself if you’ve got a lot of them? Are you sure it’s going to be soluble in water, which is a requirement property for a vaccine? And are you sure it’s going to not deform into something, once it’s not plugged in in the same way?
So you want to make alterations to that too. And finally, are you sure if, on its own, it’s going to be as immunogenic as when it was in its natural form on the outside. “Immunogenic” in the sense of: prompting the immune response we’re looking for. This particular antigen, it’s about a thousand amino acids long.
Saloni Dattani:
Okay. That’s like moderate, I guess.
Jacob Trefethen:
Yeah, it’s sort of medium-big, yeah.
Saloni Dattani:
Yeah, we talked about some tiny proteins before and they were like 20 to 30 amino acids. Then we talked about a huge protein, titin, which is like 30,000 - 33,000 something amino acids.
Jacob Trefethen:
Yep, so this one’s sort of in between those. And if you can get away with it, it might be nicer to take only the bits that really matter for the immune response, and make it smaller.
If you can still prompt an immune response, it’d be kind of nice if it was only 200 amino acids long – it might be cheaper to produce, for example. It might be- if you put it on a scaffold of a thing, of a soccer ball, and you want to put lots of different antigens on that soccer ball to prompt an immune response - if it’s smaller, you might be able to fit more on.
Saloni Dattani:
So I’m unfortunately still thinking about the analogy where I’m trying to recognise your face, and now this soccer ball has little Jacob faces all over it.
Jacob Trefethen:
Well, you would definitely recognise that.
Saloni Dattani:
It would be scary though.
But also, this makes me think of, okay- in history, we didn’t have vaccines that were just one antigen. They weren’t just one part of the thing. It was usually the entire virus, or the entire bacterium. And that would be killed in some way, maybe it’s chemically inactivated, or it’s attenuated - so it’s put into cell culture and made to evolve into something that doesn’t infect us or cause harm - and then we’re using the whole pathogen. And this is very different; this is a very precise part of the pathogen.
And it turns out, sometimes we only need that to recognise the whole pathogen. And maybe that’s also useful because there are other parts of the pathogen that are harmful to us in some ways.
Jacob Trefethen:
Exactly. In this case, with Strep A, if you know that the whole pathogen prompts a immune response that you might hurt your own heart, then you sure enough want to get rid of some of it.
Saloni Dattani:
Right. And then, are there other reasons that we would want that? What is this bacteria doing? What are the proteins doing?
Jacob Trefethen:
Well, actually in this case, yes! So I’m going to take a quick detour that it’s not central to the point. Are you okay to bear with me?
Saloni Dattani:
Yes.
Jacob Trefethen:
Okay, great. This particular antigen is kind of messed up. What it does is it hangs out on the outside of the bacteria and it’s a peptidase. Imagine it kind of looks like - for people watching the video. [crocodile jaw clapping sounds]
Saloni Dattani:
Oh, it’s like a crocodile.
Jacob Trefethen:
It’s like a crocodile. And what happens is that it’s evolved to cleave, or chop in half, say, some signalling proteins that our body sends to the pathogen to actually recruit even more immune proteins, so C5a. If you send those immune proteins, and they’re going to bring some buddies, it’s going, “Nope. Nope.”
Saloni Dattani:
Oh my god. So wait, this bacterium is trying to infect me, and then my immune response is trying to attack it in response, and it’s sent out all of these immune cells to attack it. But the signals just get cut up.
Jacob Trefethen:
They’re just getting cut up.
Saloni Dattani:
That’s really sad.
Jacob Trefethen:
It’s so sad because this is one of those things where our immune system is pretty good, and it is getting activated, and it’s going to try and mess up that bacteria, and the bacteria is like, “Hmm, no you don’t.”
Saloni Dattani:
Oh, that’s scary. So if we could somehow change this protein or this bacterium, we could find a way for us to recognise it without it cutting up those signalling proteins.
Jacob Trefethen:
Correct. Now, the reason this is a detour on the main point is that, that is not that hard. You can actually just look at this cleaver in question and, imagine it’s the crocodile jaws, you just pick the amino acid residue that’s most at the jaw.
Saloni Dattani:
So you put something in- you’re saying you could block the cutting by putting something into it.
Jacob Trefethen:
Yeah, essentially, but just changing the string of amino acids, so that at the position that’s usually histidine-193, you put leucine, or I forget, you put a different amino acid. You actually only have to make one change to this thousand amino acid long protein, and it will no longer perform the harmful function.
Saloni Dattani:
That’s amazing.
Jacob Trefethen:
You don’t need AI for that. You can validate that you’ve neutralised that aspect of it.
So just stepping back, we can get to a pretty useful initial test of a vaccine without AI, where what we did was, we said: we got this bacteria that’s invading us, and we don’t like. We took one of the things on the outside that antibodies do bind to, and we’re just going to inject that after changing a residue, so it doesn’t hurt us.
What will happen if that works? Well, the immune system broadly- but let’s just visualise antibodies- will bind that and you’ll generate a adaptive immune response that remembers how to produce those antibodies. If you get an infection of Strep A later, those antibodies are going to come and hit that bit of it.
Saloni Dattani:
So basically, I recognise your face, and if you were in a different place, like in a sand pit on a beach; you’ve dug up a hole...
Jacob Trefethen:
Who put me there?
Saloni Dattani:
I don’t know, maybe you did it yourself. Don’t people like doing that? And then maybe you’re inside, and only your head is sticking out, and I’m like, “That’s Jacob.” I know that it’s you and I don’t need to see the whole body to know that.
Jacob Trefethen:
Yes. And actually, in the context of the immune system, it’s a good metaphor because you’re trying to neutralise me, so you probably brought a bow and arrow, and if you shoot me in the head - that will actually kill all of me. I’m neutralised, so you don’t have to shoot my hands, you don’t have to shoot my legs. You got that kill shot in the head, so well done. And while I was in this sand! But I couldn’t-
Saloni Dattani:
Well, yeah, you probably couldn’t move at that point.
Jacob Trefethen:
I couldn’t move; that wasn’t my fault.
Jacob Trefethen:
So...
Saloni Dattani:
So.
Jacob Trefethen:
AI.
Saloni Dattani:
Wait, okay, wait, this protein- we just have the one protein. I think the other thing that’s really helpful about this is that - it’s not the entire bacteria that’s like harming us; it’s not invading different parts of our body, and things like that. We’ve just honed in on your face.
And that’s really useful and that’s like- okay, we can recognise this when it comes later on. Then maybe, if there are other types of... if you had an alter ego who shared your face, or if there was a different species, or there was a different strain that had some similarities like your face, but different body, then I would still be able to recognise that as well.
Jacob Trefethen:
You could neutralise jacked Jacob, you could neutralise short Jacob. Absolutely.
Saloni Dattani:
Cool. Okay, so we talked about protein vaccines and we can improve them. This is not good enough?
Jacob Trefethen:
Yes, so I mean, it may be, I think in this particular case, it probably wouldn’t be. You might want to improve that head of mine; maybe give me a little bit of Botox or something, so that I’m even more recognisable to your recognition system.
So there’s this new tool, ProteinMPNN. ProteinMPNN was made in the Baker lab. David Baker, who just won the Nobel Prize alongside the AlphaFold inventors, by him and by some students in particular: Justice Dauparas, I think, was one of the lead authors. It basically takes the structure of a protein and predicts what sequence of amino acids leads to that structure.
So the structure in the sense of- think visually, 3D, where do the residues - where do the carbon atoms in this backbone - actually appear once the protein’s all folded up? And sequence, think: what is the string of amino acids?
Saloni Dattani:
So I’m imagining a bead of strings [beads on a string], which is the protein, and the bead of strings is folded up into this larger structure. Maybe it’s a knitted ball like a kitten plays with, or something, and it’s not as symmetrical or anything. And we are trying to predict exactly what amino acid is each of the beads. We know what the string of beads looks like - what that shape of the folding is - but we’re trying to predict each of the beads.
Jacob Trefethen:
Correct. Exactly right. And we may not have a- there may be multiple strings that end up folding up to kind of similar- so you’re not always just trying to predict “What’s the exact string?” You’re often trying to give me some hypotheses here that you then want to test. In the case of this antigen we’ve been talking about, you can say: I want to make this smaller; it’s currently made of five domains – which are these subunits of the protein that self-assemble and then assemble together: Could I generate the same immune response for just two of them, the two most important ones?
So let’s just chop off- let’s just chop off the other three domains and look at these two together. Then I’m going to ask ProteinMPNN. I’m going to say, and I did this when I was at the University of Washington, I was like, okay, here’s what it looks like; here’s what I want it to look like...
Saloni Dattani:
... How do I make that?
Jacob Trefethen:
How do I make that? Once you chop off some of those domains, you can’t immediately use what’s left because there will be amino acid residues that have evolved to be in particular points inside the protein that are doing great. But if they’re exposed, some of those, if they’re exposed, will be hydrophobic, which means that what you’ve just created is not going to be soluble. So you’re going to want to mutate the amino acid sequence a little bit so you’re not exposing, for example, hydrophobic residues.
Saloni Dattani:
So I remember learning a little bit about protein structure, and one thing that I remember is, okay, so if a protein is soluble, then the outside has to be attracted to water, “hydrophilic”. But the inside is usually “hydrophobic”. And that makes it unlike oil, say, so if it was hydrophobic on the surface, then it wouldn’t dissolve.
Jacob Trefethen:
But we want it to dissolve. So we’re going to have to make some changes.
Saloni Dattani:
I guess, well, if I’m thinking about the environments of a protein, it’s usually in the blood or in cells. It’s in these places where there’s a lot of water content.
Jacob Trefethen:
I mean, most of life’s chemistry happens in aqueous solutions, so we got to be ready for that.
Saloni Dattani:
So that would be helpful not just to cut it down so you could only make two domains, but just generally. Let’s say you just had a mystery protein in front of you, and you’re thinking, how do I recreate this? I would maybe imagine- I don’t know if people actually do this, but I would just imagine, I don’t know, I’m a pharma company, or I’m a biotech company, and I found a protein and it’s doing something really cool. Or I found my competitor’s protein, and I’m like, how did they make this? I want to make this.
Then I would use this tool to try to figure out the letters that make it. And you know what, I really like the name “ProteinMPNN”, because I’m thinking it’s trying to predict each of the amino acids in the bead of strings [beads on a string], and you represent each of the amino acids with a letter, right? So it’s like “M-P-N-N.” Does that stand for an amino acid chain? Maybe it does.
Jacob Trefethen:
I wish... I’m going to look it up.
Saloni Dattani:
Does it?
Jacob Trefethen:
It does!
Saloni Dattani:
Oh my god, it does. Wow. What does it stand for?
Jacob Trefethen:
It stands for methionine (M), proline (P), asparagine (N), and asparagine (N).
Saloni Dattani:
[sound effects of an exploding brain] Whoa, that’s very smart. So yeah, I mean that’s a really good way to think about it. You’ve got the structure now you’re like: what amino acid sequence makes this? And as you said, there could be multiple amino acid sequences that make that particular structure.
I think what this tool is doing is that, at each point of the beads on the string, maybe it’s predicting one amino acid at a time, and it’s sort of going, “Maybe let’s try from the start of the string, and I think that’s aspargine [sic]. And okay, that’s that, and so given I know that, what could the next one be?” And you’re building it up, by thinking about the larger structure, but also, now that you have already predicted some of the previous amino acids on that string, that gives you a bit more information about what the next amino acid could be.
So you’re using the information that you already have from the neighbours to predict the next one. But sometimes that probably leads to you getting stuck in dead ends, sometimes it doesn’t work, and maybe that’s why it produces many predictions, and then some of them won’t be that accurate.
Jacob Trefethen:
It could be mispredicting the structure. And what a lot of people who work on protein improvements like this want to do next is, to run a check on: hold on, is this going to end up folding up like I thought it was? And I think you probably know how they do that.
Saloni Dattani:
Ooh, is it AlphaFold?
Jacob Trefethen:
It’s AlphaFold.
Saloni Dattani:
That makes sense. Okay, so just recapping. ProteinMPNN is predicting the amino acid sequence from the structure.
Jacob Trefethen:
Yes.
Saloni Dattani:
And I knew that AlphaFold predicts the structure from the sequence. So if you have predicted the sequence and then you’re like: Wait, does this actually fold into that structure that I started out with? You would use AlphaFold then.
Jacob Trefethen:
Exactly. You can kind of validate: does this fold up how I want it?
Saloni Dattani:
Right. But it’s not truly validating, right? Because you would want that to be done in a lab.
Jacob Trefethen:
Absolutely right. And so what I did in this toy project was just the steps we just described. I did not have enough time to validate in the lab whether the tweaked protein molecule was what I thought it was, or was doing what I wanted it to.
But what’s amazing about these tools is that they’ve gotten so good that, in combination- you know, I was talking to PhD students and post-docs there who had sort of been in the lab before the deep learning revolution, been in the lab during and after the deep learning revolution, and they’re like: everything’s changed.
Now, they can generate, say, their fifty favourite protein sequences that are hypotheses, that they’ve validated with AlphaFold, but haven’t truly validated. And they can say, okay, I’m going to order those up online and I’m going to validate them in the lab... next week or the week after.
Saloni Dattani:
You mentioned the change between now and the grad students before.
Jacob Trefethen:
Yeah.
Saloni Dattani:
And that kind of reminded me of what it used to be like to do this. I was recently reading this book called The Codebreaker, and it’s just about Jennifer Doudna and how she developed CRISPR and stuff like that. But actually, the author actually starts kind of much earlier on, and he describes what’s happening in the 1940s and ‘50s, when people are discovering the structure of DNA.
And this is relevant, because I think when Watson and Crick were trying to figure out that the 3D structure that led to DNA, they had to actually use these physical models of atoms in a lab. They’d have these balls-and-stick type figures and then be like: okay, what angles are working, or what is going here? What is going there? What are the specific atoms that is making this DNA?
And they didn’t even know that it was a double helix at first; I think they thought it was a triple helix, and one of them had just misremembered the amount of water that was in the total molecule or something, and that got them confused for a bit.
But that’s sort of what would be happening here before, where if you’re trying to figure out the amino acid sequence of a structure initially, and I think this was maybe pre-2000s or something, people would be actually making physical models of the same thing. Does this work if I put this amino acid here, and then the next one there, and does it fit together? Is one of them negatively charged, and the other one also negatively charged? How does it all work? And that sounds extremely complicated when you have a much bigger protein or yeah, I mean it’s just complicated. And there are so many possibilities.
Jacob Trefethen:
So we no longer have to go physical ball and physical ball.
Saloni Dattani:
I think there were some things between that and now. There were some kind of statistical modelling techniques in the 2000s and 2010s, and then there was some AI tools, and deep learning, and stuff like that that was used as well.
But this is really different because I think it’s trained on data of the structures of proteins. What that means is, over time, people would be working in the lab to figure out exactly what a protein looks like with x-ray crystallography - so you’re crystallising the protein, you’re getting an X-ray image, you’re trying to determine what it actually looks like - or various other techniques.
And then they are trying to figure out, again, what does this look like? And because you would then figure out the coordinates of the different atoms, you then have a lot of data that’s collected over the last few decades that has gone into this database called Protein Data Bank, and I think that’s the biggest one.
Using all of that data – of the coordinates of the atoms, and which amino acid sequences there are, and maybe similar types of proteins, and things like that – you could then make these predictions a lot better. So I think that’s what this is doing. It’s this graph-based deep neural network. So it has data on the coordinates, which you’ve also put into it with the overall structure that you have, and then it’s like: which amino acid sequence is in each part of the chain?
Jacob Trefethen:
Yeah, and that data is- we’re talking over a hundred thousand - I think it was 170,000 - structures that AlphaFold2 trained on.
Saloni Dattani:
So that’s individual structures that people have worked out in the lab through these other methods, like x-ray crystallography and making these physical models-
Jacob Trefethen:
Which is crazy.
Saloni Dattani:
It’s a huge amount.
Jacob Trefethen:
Decades, probably fifty years of work of thousands, tens of thousands, of grad students, postdocs, professors - mostly on public funding.
Saloni Dattani:
I guess I’m imagining in the past, you might just be an individual person who’s like, “Let me try to guess what the amino acid sequence is based on stuff that I’ve learned.” and that’s much harder than if you just are this computer model that’s trained on so much of it. You don’t have to try to remember each of the structures.
Jacob Trefethen:
I can do it, but I know a lot of other people struggle. Yeah, they really did it different.
Saloni Dattani:
Okay, so quick recap. We have this protein in the streptococcus bacteria that we want to make and we want to change a bit. What was the change that you mentioned that we’re doing here?
Jacob Trefethen:
We’re going to make it smaller and just use a couple of its domains.
Saloni Dattani:
So we’re only using a few domains, and that should be enough for us to recognise it without causing other problems for us.
Jacob Trefethen:
That’s the hope.
Saloni Dattani:
Right. Alright. And then what’s next?
Jacob Trefethen:
Well, what you would do next – that I didn’t get to do because I had to move on, would be trying to validate what ProteinMPNN and AlphaFold have helped you get to as hypotheses, but validate them in the lab.
Saloni Dattani:
So we’ve got the domains that we want, we know the structure that we want, we then have figured out the amino acid sequence that leads to that, and then we want to check: does that sequence actually create that structure?
Jacob Trefethen:
Yes. And the asterisk I’d give though is that we have hypotheses of multiple sequences where they’re a little bit different because ProteinMPNN is giving us some hypothesis.
Saloni Dattani:
Right, and some of them might be wrong.
Jacob Trefethen:
Some of them might be wrong, some might be better than others, and a lot can change with just one amino acid change here or there.
Saloni Dattani:
Right. As you said before, that one amino acid change means that it can no longer cut up our signalling proteins.
Jacob Trefethen:
So what you would do here- and as a spoiler, after I’d done all this thinking and computational stuff, I was like, you know, before I order these up, let me just check if someone has done this. And you’ll be amazed to hear that a group at Shandong University in China had already done all these experiments three years ago.
Saloni Dattani:
Aww.
Jacob Trefethen:
It’s terrible to have an idea.
Saloni Dattani:
So they had already come up with this slightly adjusted version of this protein with just those domains, and they had tested that it worked in a lab... and? Did they?
Jacob Trefethen:
They had done the subdomain analysis. I’m not sure which alterations they’ve made, and if they used AI for those alterations. But the basic punchline of what they did was- Let me walk through what validation would look like here.
At the very end of the day, the validation we care about is: if you inject this, or inhale it, or something, as a vaccine: will it protect you as a human being against a future Strep A infection? We’re not going to test that.
We’re going to test some earlier things to get an initial idea. So what you can do is, you can order - let’s say we had 50 hypothesis amino acid strings - you can order up the DNA that would code for each of those amino acid strings. Remember from our first episode, we go DNA to RNA to proteins, so we want the DNA sequence.
Saloni Dattani:
So we’ve basically made each of the sequences that ProteinMPNN has spit out.
Jacob Trefethen:
Yes.
Saloni Dattani:
We’ve made a bunch of them.
Jacob Trefethen:
We’ve made a bunch of them, say, fifty, and we’re getting the DNA strings. You can order those online or from your favourite provider, Twist Biosciences and IDT- you can get them delivered to the lab, maybe next week, it’ll be pretty quick. You can then take those DNA strings, and try and grow up the proteins by putting the DNA inside a living system - so say bacteria like E. coli or yeast or-
Saloni Dattani:
What is grow-? Why are you telling this protein to grow up?
Jacob Trefethen:
I’m telling YOU to grow up, Saloni! I am telling the protein to grow up because I need to use it in experiments myself.
Saloni Dattani:
But what is growing up? What is-
Jacob Trefethen:
Growing up is-
Saloni Dattani:
What is a baby protein?
Jacob Trefethen:
A baby protein is, well- more like a butterfly situation here, where I want the butterfly, so the caterpillar is a string of DNA, and I’m going to have to get a little cocoon going. That cocoon, in this case, is a bunch of yeast cells that I’ve put in a little vat, and I’m going to feed nutrients to and I’m going to jostle around for two days, and they’re going to have a good time in that.
Saloni Dattani:
Aww! That’s really cute.
Jacob Trefethen:
And by the end, I’m going to slice ‘em up! And I’m going to take the protein-
Saloni Dattani:
This got really violent.
Jacob Trefethen:
I would never hurt a butterfly, but a yeast cell... watch out.
Saloni Dattani:
Okay, so you are trying to recreate this protein by growing it up in a bacterium.
Jacob Trefethen:
Yes.
Saloni Dattani:
From the DNA?
Jacob Trefethen:
Yes, exactly.
Saloni Dattani:
And you have now made lots of them.
Jacob Trefethen:
Now I’ve made lots of them, and here’s what I might do with them. Experiment number one: I would probably take a bunch of those proteins and inject them - as if they’re a vaccine - into some mice. And the mice are going to have some sort of response to that, or maybe they won’t.
The response I’ll first check is: take a sample of blood from them, I might say a month later and see, okay, did any antibodies get produced? And if so, sign number one we’re headed in the right direction. If not, uh-uh, this ain’t so good.
Saloni Dattani:
Uh oh. And well, you’re not just testing: does it produce any antibodies? You’re like: does it produce antibodies against the specific protein?
Jacob Trefethen:
Absolutely. Well, the next experiment I would do is against the specific- what I would do is: take a ‘wild type’ of the protein, so don’t take this thing I just grew up, get some Strep A bacteria and take this isolate somehow.
Saloni Dattani:
Right. So, does the mouse now create antibodies against the protein that I was trying to use as the target?
Jacob Trefethen:
Exactly. So I would take the serum, blood, and say, are the antibodies produced? Let’s say there were some produced, so it’s immunogenic. Are they cross-reactive to the wild type of the protein?
Saloni Dattani:
Right.
Jacob Trefethen:
So do they bind it tightly? Do they bind it a lot?
Saloni Dattani:
Yeah. So did your little protein vaccine actually help the mouse protect itself with antibodies against the natural protein?
Jacob Trefethen:
Yes, but I would go even one step further, which is, so that’s the next experiment, but to protect the mouse, you actually care about: is it protected against the bacteria? Not just the protein. The third experiment’s still in a dish.
I’m still taking my serum that has antibodies in, I’m putting it against the bacteria and I’m saying, does it bind the bacteria or neutralise bacteria? The best thing after that would be: okay, we injected this mouse and now we’re going to challenge it with some sort of - I know, I know - with a bacteria. And then some of them-
Saloni Dattani:
-are probably going to die.
Jacob Trefethen:
Are probably going to die, yeah.
Saloni Dattani:
How dangerous is this infection?
Jacob Trefethen:
For most humans? Not super dangerous. So probably true for most mice.
Saloni Dattani:
Okay. So you’ve now checked that a different research group has made this and validated it. And now you know that that protein that you improved could have worked.
Jacob Trefethen:
Yeah, basically it turns out that, of the five domains that make up that protein, if you take some subsets, it’s not looking good. You take some other subsets, it’s looking pretty good.
So it actually looks like it’s probably worth doing. If you take those subsets that look like maybe they work, you can then create a vaccine involving maybe just that, or involving other subsections of other proteins that might be antigens.
Saloni Dattani:
That’s really cool.
Jacob Trefethen:
It’s really cool. It’s really cool.
Saloni Dattani:
But this would still just be the start of the whole process of developing a new vaccine. You would then need to test it in human clinical trials like- phase one, phase two, phase three- that could take another eight years, ten years, something like that.
Jacob Trefethen:
That’s right, and AI has not sped up that yet, so there is more to do.
Saloni Dattani:
Has this been done before? Are there drugs and vaccines already that have been improved with AI?
Jacob Trefethen:
There is a drug which was made during the COVID pandemic by Neil King, at the Institute for Protein Design, and David Veesler and students there, that use predecessor tools to create a vaccine using just the RBD region receptor binding domain of the spike protein.
So the vaccines that I got, at least, I got the- what did I get? I got Pfizer followed by Moderna maybe? And those are using the full spike protein and that was pretty good. But if you can get away with not using the full one, you might be able to do even better.
And sure enough, Neil King managed to say, I’m going to use the really important domain - the receptor binding domain only - and I’m going to encode it on a nanoparticle, but like a soccer ball, and I’m going to stick out a lot of them, and it’s going to generate really good antibody response-
Saloni Dattani:
That’s very cool.
Jacob Trefethen:
- and that happened within the last five years!
Saloni Dattani:
Wow.
Jacob Trefethen:
It’s wild. Yeah.
Saloni Dattani:
I’m also thinking, it’s kind of reminded me of the RSV vaccines. Were they also improved with AI, I think?
Jacob Trefethen:
The story I remember there is Jason McLellan at the VRC in 2013. My guess is that that main breakthrough for RSV pre-dated the deep learning revolution, but I don’t know the story of it. Probably did involve cryo-electron microscopy, which probably was helped by ML?
Saloni Dattani:
So I think what I remember is RSV - which is respiratory syncytial virus; it’s a lung infection that’s one of the most common reasons that infants go to hospital in the US. In the 1960s and ‘70s, people tried to make RSV vaccines, and there was this protein on the surface of the virus that they used as a target, but it wasn’t working very well. It had a lot of side effects, and in some cases, it was actually making the infection much worse when people got infected.
So a lot of people just gave up at that point, and it was just seen as this- this is this unsolvable challenge: “We’re not going to be able to develop vaccines against RSV, sorry.”
Then something changed in the 2010s, which was that this type of electron microscopy became much better, and I think it was a software improvement that- you could then figure out, at a much higher resolution, what these proteins looked like. So what they figured out was: at the surface of the virus, the protein looks a certain way before it infects the cells, but a different way after it infects the cells, because it uses this protein to fuse to the cell and enter it. Unfortunately, the previous vaccines were using the ‘after’ version of the protein.
Jacob Trefethen:
The post-fusion, yeah.
Saloni Dattani:
But that doesn’t work. Because if the virus is swimming around in your blood, and your immune cells only look know what it looks like after it’s entered, well, that’s too late. You need to figure out what it looks like before.
So they figured out what it looked like before it entered the cell, and that is called the ‘pre-fusion’ version of the protein. In order to remake vaccines with that version, I think they used AI to introduce stabilizing mutations and keep it that way.
And now, we have at least three RSV vaccines that have been approved in the last two or three years. This breakthrough, in the 2010s with microscopy, meant suddenly we know how to design an RSV vaccine now. So there were multiple people who were like: “Oh, well, now we can do it.” And so it wasn’t just one person who made the breakthrough, but I think this was the big breakthrough- was the microscopy-
Jacob Trefethen:
And that big breakthrough-
Saloni Dattani:
-and AI.
Jacob Trefethen:
And the microscopy breakthrough happened on public funding at the NIH Vaccine Research Center.
Saloni Dattani:
Whoa.
Jacob Trefethen:
And then the vaccines that- there are now many vaccines-
Saloni Dattani:
Saving a lot of babies.
Jacob Trefethen:
And saving a lot of babies. So that’s my pitch, implicit pitch, that often you need a breakthrough. Often, it takes researchers who aren’t trying to go after a product, they’re actually trying to understand something.
Saloni Dattani:
And something might look like an unsolvable challenge for decades and then something changes and now three people can do it.
Jacob Trefethen:
So, stepping back to summarise the whole story here. We started with an invader we want to make a vaccine against. We took a protein that might be an antigen, and we started tweaking that protein with the help of a couple of AI systems - ProteinMPNN and AlphaFold - to come up with some hypotheses of sub-units of that protein, smaller versions of it, that could be vaccines. The initial results are that maybe some of them actually look promising, and should be taken further and explored further.
Saloni Dattani:
Right, and so where is the one that you talked about, the ScpA? Is that in clinical trials right now, what’s going on?
Jacob Trefethen:
Not yet, because everything is slower than it should be in vaccine design, especially for global health. But I would say it’s one of probably the top 10 antigens that people are exploring pretty seriously. People are looking at combinations of those antigens in four or five different proteins, in combination with other adjuvants that help prompt an even stronger immune response. And I’m cautiously hopeful that one of those combinations will actually prove to work.
Saloni Dattani:
Just to indulge you a bit on Strep A: what is the reason that- what’s going on with the field? Why don’t we have much more interest in this? And how far away are we from a vaccine?
Jacob Trefethen:
I think that there are a couple answers to why there’s not more interest. The really predominant one is that most of the deaths that Strep A leads to - so the biggest forms of harm - occur in countries that are not wealthy, so there’s not as much of an incentive as there could be, for pharmaceutical companies to prioritise it.
That said, there’s a pretty decent incentive because a lot of parents with young kids are not exactly fans of Strep and would be probably quite excited if there were an available vaccine just to prevent Strep throat, or sore throat pharyngitis, and scarlet fever, and all of that.
I think there are other reasons too. What you really are making me want to do is a whole episode on Strep, because it’s so interesting. To answer the last part of your question, I am actually fairly hopeful that this should be a solvable issue. I think you should be able, with modern techniques, to design around some of previous problems, and you should be able to test whether these things work, and get an answer. What we care about most is figuring out how they work in children, so that you can prevent these repeated infections that lead to problems.
Saloni Dattani:
And so, we talked about AI being used to: one, improve or cut up this protein into a specific domain, and then see if this domain is still soluble, and if not, improve that solubility. Then I also mentioned RSV vaccines, where AI has been used to stabilize a particular protein that’s used in the vaccine. What other uses of AI is there, when you’re applying it to proteins and improving them?
Jacob Trefethen:
Two more come to mind, of properties that are often really nice for vaccines. One is thermo-stability. Human proteins have evolved to behave very well at human body temperature. If you’re shipping a vaccine around the world, sometimes you’re in colder and sometimes you’re in hotter temperatures than the human body. So if you can make sure that that won’t denature your protein, then you’re going to still have a useful vaccine out the other end. So you might want to make some tweaks for that.
Another one is, we talked about immunogenicity a bit earlier, of- does this prompt any antibodies, for example. You also care about what’s called “immuno-focusing” sometimes, where: can you present the parts of a given protein, say, the epitopes that are most reactive, as much as possible, or in the right geometry, so that you can target the response to the most productive bits. So it’s not just, are you prompting any antibodies, but are you prompting the right ones? Are you prompting it as frequently as possible and binding as tight as possible?
Saloni Dattani:
Right. So these are all ways of making sure that our immune system recognises the protein or the whole pathogen well, how well it’s doing that, and then also maybe optimizing the way that it’s doing that.
I think you mentioned this at some point, when we were talking earlier, about how sometimes if you are recognising a pathogen, that might confuse your immune system, because parts of a pathogen might look like other parts of your body, and then you could develop an autoimmune reaction to other parts of your body, because your immune cells have treated that as “foreign” as well, after seeing the pathogen. So we’re trying to do all of these things, potentially.
Jacob Trefethen:
All of these things.
Saloni Dattani:
Is there anything else? Are there other uses that we would have for AI?
Jacob Trefethen:
I bet you there are, and I can’t wait for listeners to write in and tell us or start working in their basement on some of these applications.
Saloni Dattani:
Cool, right, but this is also only the beginning. There has to be a lot of testing and stuff before things get to the market, so that it can be used for people.
And also, there’s lots of stuff that happens before this process – like collecting all of the data in the first place, doing the laboratory research, figuring out these structures, making sure that there is data for AI models to train on and help us make these improvements.
Jacob Trefethen:
Okay, so we’re improving proteins found in nature. What about if... we could... design... entirely new ones... never seen before?
Saloni Dattani:
[gasps] I want to do that!