CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models

Vamsi Nallapareddy, Nicola Bordin, Ian Sillitoe, Michael Heinzinger, Maria Littmann, Vaishali P. Waman, Neeladri Sen, Burkhard Rost, Christine Orengo

Research output: Contribution to journalArticlepeer-review

9 Scopus citations


Motivation: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. Results: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 6 0.4% and 98.2 6 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned.

Original languageEnglish
Article numberbtad029
Issue number1
StatePublished - 1 Jan 2023


Dive into the research topics of 'CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models'. Together they form a unique fingerprint.

Cite this