TY - JOUR
T1 - ProNA2020 predicts protein–DNA, protein–RNA, and protein–protein binding proteins and residues from sequence
AU - Qiu, Jiajun
AU - Bernhofer, Michael
AU - Heinzinger, Michael
AU - Kemper, Sofie
AU - Norambuena, Tomas
AU - Melo, Francisco
AU - Rost, Burkhard
N1 - Publisher Copyright:
© 2020 Elsevier Ltd
PY - 2020/3/27
Y1 - 2020/3/27
N2 - The intricate details of how proteins bind to proteins, DNA, and RNA are crucial for the understanding of almost all biological processes. Disease-causing sequence variants often affect binding residues. Here, we described a new, comprehensive system of in silico methods that take only protein sequence as input to predict binding of protein to DNA, RNA, and other proteins. Firstly, we needed to develop several new methods to predict whether or not proteins bind (per-protein prediction). Secondly, we developed independent methods that predict which residues bind (per-residue). Not requiring three-dimensional information, the system can predict the actual binding residue. The system combined homology-based inference with machine learning and motif-based profile-kernel approaches with word-based (ProtVec) solutions to machine learning protein level predictions. This achieved an overall non-exclusive three-state accuracy of 77% ± 1% (±one standard error) corresponding to a 1.8 fold improvement over random (best classification for protein–protein with F1 = 91 ± 0.8%). Standard neural networks for per-residue binding residue predictions appeared best for DNA-binding (Q2 = 81 ± 0.9%) followed by RNA-binding (Q2 = 80 ± 1%) and worst for protein–protein binding (Q2 = 69 ± 0.8%). The new method, dubbed ProNA2020, is available as code through github (https://github.com/Rostlab/ProNA2020.git) and through PredictProtein (www.predictprotein.org).
AB - The intricate details of how proteins bind to proteins, DNA, and RNA are crucial for the understanding of almost all biological processes. Disease-causing sequence variants often affect binding residues. Here, we described a new, comprehensive system of in silico methods that take only protein sequence as input to predict binding of protein to DNA, RNA, and other proteins. Firstly, we needed to develop several new methods to predict whether or not proteins bind (per-protein prediction). Secondly, we developed independent methods that predict which residues bind (per-residue). Not requiring three-dimensional information, the system can predict the actual binding residue. The system combined homology-based inference with machine learning and motif-based profile-kernel approaches with word-based (ProtVec) solutions to machine learning protein level predictions. This achieved an overall non-exclusive three-state accuracy of 77% ± 1% (±one standard error) corresponding to a 1.8 fold improvement over random (best classification for protein–protein with F1 = 91 ± 0.8%). Standard neural networks for per-residue binding residue predictions appeared best for DNA-binding (Q2 = 81 ± 0.9%) followed by RNA-binding (Q2 = 80 ± 1%) and worst for protein–protein binding (Q2 = 69 ± 0.8%). The new method, dubbed ProNA2020, is available as code through github (https://github.com/Rostlab/ProNA2020.git) and through PredictProtein (www.predictprotein.org).
KW - ProtVec
KW - binding protein prediction
KW - binding residue prediction
KW - machine learning
KW - profile kernel SVM
UR - http://www.scopus.com/inward/record.url?scp=85082481064&partnerID=8YFLogxK
U2 - 10.1016/j.jmb.2020.02.026
DO - 10.1016/j.jmb.2020.02.026
M3 - Article
C2 - 32142788
AN - SCOPUS:85082481064
SN - 0022-2836
VL - 432
SP - 2428
EP - 2443
JO - Journal of Molecular Biology
JF - Journal of Molecular Biology
IS - 7
ER -