TY - JOUR
T1 - Superior protein thermophilicity prediction with protein language model embeddings
AU - Haselbeck, Florian
AU - John, Maura
AU - Zhang, Yuqi
AU - Pirnay, Jonathan
AU - Fuenzalida-Werner, Juan Pablo
AU - Costa, Rubén D.
AU - Grimm, Dominik G.
N1 - Publisher Copyright:
© The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
PY - 2023/12/1
Y1 - 2023/12/1
N2 - Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Evergrowing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew’s correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.
AB - Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Evergrowing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew’s correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.
UR - http://www.scopus.com/inward/record.url?scp=85175471494&partnerID=8YFLogxK
U2 - 10.1093/nargab/lqad087
DO - 10.1093/nargab/lqad087
M3 - Article
AN - SCOPUS:85175471494
SN - 2631-9268
VL - 5
JO - NAR Genomics and Bioinformatics
JF - NAR Genomics and Bioinformatics
IS - 4
M1 - lqad087
ER -