This is reposted from a Q&A we did in September 2023. Check out our Slack for more.
We had the privelege to chat with CEO & co-founder, Tristan Bepler, from OpenProtein.AI! OpenProtein.AI is a deep learning platform that accelerates protein engineering.
Read more to learn about:
Biggest open problems in the field of protein engineering
Recent advances in protein engineering
New generative protein language model: PoET
and more!
Interview by Nicholas Larus-Stone, founder of Bits In Bio
Question #1:
Nicholas: Welcome @tristan! very excited to have you here to talk about openprotein ai and the intersection of ml & protein engineering. i’ll get us started with a few questions and hopefully the community will jump in as well!
let’s start with a bit of background. how did you get interested in science and the intersection of ml and science?
Tristan: Great to be here!
Tristan: Hmm that's a tricky one to really pin down. as an undergrad, i first worked in an rna biology lab (mariano garcia-blanco's group, then at duke) and then moved to raluce gordan's group as i became more interested in algorithms and software for understanding biology. i first started dabbling with ml in application to predicting dna sequence specificities of transcription factors with raluca, which would be my first foray into ml/science. what really drew me to ml approaches was the idea that these systems are really complicated and it wasn't clear to me that we would ever be able to tease out all the mechanistic factors and be able to build a sufficiently predictive mechanistic model through traditional reductionist approaches, so ml and computer learning theory seemed like the most promising way to start to develop algorithms that could accurately predict what was going on
Tristan: When i started my phd, i became increasingly convinced that we needed to use these kinds of approaches, especially after working on biocompilers and mechanistic modeling of gene and protein signaling circuits with ron weiss. we worked on stochastic methods for modeling a complex protein-level signaling network from first principles, which was really cool, but ultimately had to do a lot of inference of parameters that were extremely underdetermined and we didn't even know if the topology (if you will) of our model was correct. plus, how would you even start to build a first principles model of how the synthetic circuit interacts with endogenous systems of the host cell?
Tristan: That brought me back to proteins as the fundamental building block we needed to understand and it made sense to start thinking about big data approaches to understanding proteins from evolutionary sequence data
Tristan: So i guess i've always been interested in how we can know things and how we can use computers to help understand and predict the behaviours of really complex systems that evade the understanding of our puny brains
Question #2:
Nicholas: Tell us a bit more about openprotein — what do you do and why did you start it?
Tristan: We're building ml models, algorithms, and software to accelerate protein engineering. there's been a lot of exciting work in large scale deep learning models for understanding proteins, but translating these into practical tools to enable protein engineering and optimization is non-trivial from an infrastructure and algorithms and statistics perspective, so we wanted to build something that addressed that need. a platform that could be used by biologists with an easy-to-use interface and could also enable bioinformaticians and data scientists via apis.
Tristan: Ultimately i really wanted to make protein language models and their (imo transformative) potential into something that could be easily accessed by the community
Tristan: There's also still a lot of space to improve our foundation models (as we've been working on with models like poet https://arxiv.org/abs/2306.06156) and build out specialized/fine-tuned models for solving specific problems like protein structure prediction, protein-protein and protein-small molecule interaction prediction, etc.
Question #3:
Nicholas: What are the biggest open problems in the field of protein engineering and how do you think ml can help? any ideas for how the bits in bio community could contribute?
Tristan: I think there are a number of fundamental scientific(?) problems and there are also practical problems. one very fundamental problem:
• do we have a language to describe protein function that is sufficiently robust and expressive? how would you communicate the complete set of constraints and requirements that a designed protein needs to satisfy, in detail, in such a way that a computer can understand it?
some other fundamental problems that are maybe more actionable are
• protein-protein interaction prediction and protein binder design that incorporates modeling of flexibility of the target. proteins aren't rigid bricks and interfaces deform and change conformation when interacting which can make designing binders tricky (recent advances here with rfdiffusion are exciting!)
• similar to ^ but for protein-small molecule interaction prediction, which is critical for small molecule drug development, especially for using predicted structures which may not present useful holo structure binding pockets
• predicting multimer protein structures
• predicting core protein properties such as expressibility, stability, thermostability, and solubility remain hard. solubility is a surprisingly difficult problem and is difficult to assay in high throughput
• immunogenicity
Tristan: On a, perhaps, more practical front i think there are also challenges in building tools that work with how protein engineers, especially in industry, actually do things day-to-day
• data management and processing - many different assays, different machines, and different conditions. sometimes sequences are known and sometimes they aren't. some do replicates and some don't. assay conditions are often changed as development progresses, etc.
• algorithms and software that are easy to integrate with existing processes
• helping the wet lab scientists to understand how they can change their processes to be more efficient and approach the sequence design and optimization problems more quantitatively
Question #5:
Nicholas: What have been the recent advances in the field which you’ve been most excited about?
Tristan: Rfdiffusion is very cool
Tristan: Of course alphafold2 was extremely exciting as are newer advances like esm2 and esmfold
Tristan: Bayesian approaches to protein optimization (e.g., as first considered in some of francis arnold and kevin yang's work) and ideas around addressing domain shift, like https://arxiv.org/abs/1901.10060, are really interesting to me
Tristan: Within the space of understanding the fundamental nature of protein sequence design as a statistical learning problem, i think this paper (https://arxiv.org/abs/2306.00872) by clara fannjiang and jennifer listgarten is quite insightful
Tristan: This is also an extremely interesting recent paper on the ability to predict protein function from evolutionary models: https://www.biorxiv.org/content/10.1101/2022.01.29.478324v1
Question #6:
Nicholas: Let’s talk more about poet (https://arxiv.org/abs/2306.06156) — what do you see as the fundamental advance here and why do current models fall short
Tristan: One of the major limitations of current language models (for proteins and otherwise) is the need to summarize the entire training corpus in the model parameters. this is a big reason we've seen the trend of larger and larger models in nlp and, to a lesser extent, proteins. these models become prohibitively expensive to train and also to use for inference
Tristan: This was the main problem we wanted to address - can we (partly) decouple our language model from the training corpus, instead retrieving relevant data from the corpus and augmenting the model with that data at inference time, so that we can use fewer parameters while also providing explicit examples for our model to condition on when making prediction? the secondary problem we wanted to address was to remove the requirement for sequences to be aligned
Tristan: What this means is we trained poet to observe a set of sequences at inference time and then model the generative distribution over those sequences (extrapolating their fitness landscape). this allowed us to use fewer parameters while still improving the state-of-the-art in zero-shot variant effect prediction, by learning the general principles of evolutionary processes rather than memorizing specific protein families
Question #7:
Nicholas: You’ve been talking a lot about sequence models, but we’ve seen some of the most impressive advances come in the field of protein structure — how do you think about using both sequence and structural information
Tristan: I don't think of these as being that fundamentally different - ultimately if we're talking about protein engineering we can only design the sequence. we may have some idea of what structure we want that sequence to form to accomplish a functional objective, but we can't directly manipulate it
Tristan: Designing structural backbones and then designing sequences based on their propensity to form those backbones is a powerful tool if you know what structure you want and that it will accomplish your functional objectives, but often this is limited to binding interfaces
Tristan: So it comes down to function/phenotype prediction
Tristan: And if you're working from predicted structure, then those are actually sequence-based models, so you can look at sequence -structure -function, but structure is often actually a nuisance variable in that case. it's the functional objective that matters and the sequence is what you can control
Nicholas: Arguably binding interfaces (if you include active sites) make up the majority of what people are interested in?
Tristan: I'd argue in most cases it's more nuanced. for example, if you want to change the substrate specificity of an enzyme, is it enough for it to just bind to the new substrate? no, it needs to bind in the right configuration and have sufficient flexibility and accessibility to catalyze the reaction and release the products
Tristan: In the case of "simple" binders like antibodies, many might claim that binder design is already solved by high throughput technologies like phage display, but actually getting those antibodies to be well behaved and produced the desired phenotypic outcomes is a challenge
Tristan: There are examples of antibodies that only produce therapeutic efficacy if the binding affinity is within a specific range, for example, rather than stronger being better. other properties like expressibility, aggregation potential, and immunogenicity don't have any clear relationship to a binding interface
Tristan: That's not to say binding interfaces aren't a huge category of protein design problems - they are. but it's often not the hardest part of the problem
Nicholas: I want to thank @tristan for being generous with his time on a Sunday and his really thorough answers! if people have more questions, feel free to add them to the channel and Tristan may answer if he has time
Tristan: My pleasure, happy to answer more questions async
Tristan: As a final plug, if you're interested in learning more about how you can integrate ml into your protein engineering processes to design better proteins faster and cheaper, you can checkout our https://www.openprotein.ai/explore|website and sign up for early access http://23939013.hs-sites.com/openprotein.ai-early-access-sign-up-form?utm_source=twitter&utm_medium=post&utm_campaign=easu|here