Poster Presentation The 48th Lorne Conference on Protein Structure and Function 2023

Characterisation of the pathogenic effect of missense mutations in proteins via machine learning (#137)

Qisheng Pan 1 , Georgina Parra 1 , Stephanie Portelli 1 , Thanh-Binh Nguyen 1 , David Ascher 1 2
  1. University of Queensland and Baker Institute, Melbourne, VIC, Australia
  2. Department of Biochemistry, University of Cambridge, Cambridge, UK

Proteins control most fundamental cellular and biological processes. Small changes in the protein sequence, however, can alter these tightly regulated functions, and may be associated with a wide range of diseases. Unfortunately, it is time-consuming and technically-challenging to experimentally elucidate the effects of all possible missense variants. While the gold standard tools to assess pathogenicity rely primarily on gene/protein sequence, we hypothesised that considering these mutations in the context of the protein structure and their interactions would be more informative. To address this, we developed a machine-learning pipeline that uses computational structural and biophysical tools to better predict clinical pathogenicity. Here, we present two applications of this methodology in cancer and Alzheimer’s Disease (AD).

In our first case study, we generated two separate, interpretable machine learning models to detect pathogenicity in the tumour suppressor p53. When adding on experimental data to our method, we were able to identify oncogenic p53 mutations with a Matthew's Correlation Coefficient (MCC) of 0.88 on a non-redundant test set, and 0.83 on a clinical validation test set, which shows model robustness and promising clinical translation potential. When excluding difficult-to-characterise experimental data, our model performance deteriorated by only 8%, suggesting that protein structural information was providing complementary insight. Interpreting our models revealed that information on residual p53 activity, polar atom distances, and changes in p53 stability were instrumental in the decisions. Our tools outperformed widely-used pathogenicity predictors, and had comparable performance with p53-specific methods. Our predictors offered clinical diagnostic utility, which is crucial for patient monitoring, and personalised cancer treatments.

Following our successful approach in p53, we applied the same principles to identify AD-causing mutations across 22 different proteins. Our combined preliminary model obtained a MCC of up to 0.87 on non-redundant tests, demonstrating model robustness and generalisability. This generic model was compared to protein-specific models, in order to refine our clinical computational pipeline rules for applications in other diseases. This work aims to not only provide clinically relevant tools, but provides a foundation to better understand the relationships between protein sequence-structure-function-pathogenicity.