Poster Presentation The 48th Lorne Conference on Protein Structure and Function 2023

Predicting toxicity of a protein from its primary sequence (#142)

Vladimir Morozov 1 , Carlos H.M. Rodrigues 1 , David B. Ascher 1
  1. University of Queensland and Baker Institute, Melbourne, VIC, Australia

Biologics are one of the most rapidly expanding classes of therapetuics, but despite the enormous potential of peptide and protein based drugs, they can be associated with a range of toxic properties. In small molecule drug development, early identification of potential toxicity led to a significant reduction in clinical trial failures, however we currently lack similar robust qualitative rules or and predictive tools for peptide and protein based biologics. 

To address this, we have manually curated the largest set of high quality experimental data on peptide and protein toxicities. In total, toxicity information on 217000 peptides and proteins was identified. Using the 2344 datapoints not present in previously published databases, we found that existing approaches performed poorly. This in part may be due to errors in previous data curation attempts.

Harnessing this data, we developed a novel in-silico protein toxicity classifier which relies solely on the protein primary sequence. Protein sequence information was encoded using an adapted version of the deep learning language model called BERT. While originally designed to process natural languages, it has been applied to understand “biological” language, where residues act as words and protein sequences act as sentences. Our predictive models achieved robust and generalisable predictive performance across multiple non-redundant blind tests. We are now using interpretative approaches in order to better understand the biological basis for protein toxicity. This work will serve as a valuable platform to minimise potential toxicity in the biologic development pipeline.