TY - JOUR
T1 - Deep Learning Enhances Precision of Citrullination Identification in Human and Plant Tissue Proteomes
AU - Gabriel, Wassim
AU - González, Rebecca Meelker
AU - Laposchan, Sophia
AU - Riedel, Erik
AU - Dündar, Gönül
AU - Poppenberger, Brigitte
AU - Wilhelm, Mathias
AU - Lee, Chien Yun
N1 - Publisher Copyright:
© 2025 THE AUTHORS.
PY - 2025/3
Y1 - 2025/3
N2 - Citrullination is a critical yet understudied post-translational modification (PTM) implicated in various biological processes. Exploring its role in health and disease requires a comprehensive understanding of the prevalence of this PTM at a proteome-wide scale. Although mass spectrometry has enabled the identification of citrullination sites in complex biological samples, it faces significant challenges, including limited enrichment tools and a high rate of false positives due to the identical mass with deamidation (+0.9840 Da) and errors in monoisotopic ion selection. These issues often necessitate manual spectrum inspection, reducing throughput in large-scale studies. In this work, we present a novel data analysis pipeline that incorporates the deep learning model Prosit-Cit into the MS database search workflow to improve both the sensitivity and the precision of citrullination site identification. Prosit-Cit, an extension of the existing Prosit model, has been trained on ~53,000 spectra from ~2500 synthetic citrullinated peptides and provides precise predictions for chromatographic retention time and fragment ion intensities of both citrullinated and deamidated peptides. This enhances the accuracy of identification and reduces false positives. Our pipeline demonstrated high precision on the evaluation dataset, recovering the majority of known citrullination sites in human tissue proteomes and improving sensitivity by identifying up to 14 times more citrullinated sites. Sequence motif analysis revealed consistency with previously reported findings, validating the reliability of our approach. Furthermore, extending the pipeline to a tissue proteome dataset of the model plant Arabidopsis thaliana enabled the identification of ~200 citrullination sites across 169 proteins from 30 tissues, representing the first large-scale citrullination mapping in plants. This pipeline can be seamlessly applied to existing proteomics datasets, offering a robust tool for advancing biological discoveries and deepening our understanding of protein citrullination across species.
AB - Citrullination is a critical yet understudied post-translational modification (PTM) implicated in various biological processes. Exploring its role in health and disease requires a comprehensive understanding of the prevalence of this PTM at a proteome-wide scale. Although mass spectrometry has enabled the identification of citrullination sites in complex biological samples, it faces significant challenges, including limited enrichment tools and a high rate of false positives due to the identical mass with deamidation (+0.9840 Da) and errors in monoisotopic ion selection. These issues often necessitate manual spectrum inspection, reducing throughput in large-scale studies. In this work, we present a novel data analysis pipeline that incorporates the deep learning model Prosit-Cit into the MS database search workflow to improve both the sensitivity and the precision of citrullination site identification. Prosit-Cit, an extension of the existing Prosit model, has been trained on ~53,000 spectra from ~2500 synthetic citrullinated peptides and provides precise predictions for chromatographic retention time and fragment ion intensities of both citrullinated and deamidated peptides. This enhances the accuracy of identification and reduces false positives. Our pipeline demonstrated high precision on the evaluation dataset, recovering the majority of known citrullination sites in human tissue proteomes and improving sensitivity by identifying up to 14 times more citrullinated sites. Sequence motif analysis revealed consistency with previously reported findings, validating the reliability of our approach. Furthermore, extending the pipeline to a tissue proteome dataset of the model plant Arabidopsis thaliana enabled the identification of ~200 citrullination sites across 169 proteins from 30 tissues, representing the first large-scale citrullination mapping in plants. This pipeline can be seamlessly applied to existing proteomics datasets, offering a robust tool for advancing biological discoveries and deepening our understanding of protein citrullination across species.
UR - http://www.scopus.com/inward/record.url?scp=105001199302&partnerID=8YFLogxK
U2 - 10.1016/j.mcpro.2025.100924
DO - 10.1016/j.mcpro.2025.100924
M3 - Article
C2 - 39921205
AN - SCOPUS:105001199302
SN - 1535-9476
VL - 24
JO - Molecular and Cellular Proteomics
JF - Molecular and Cellular Proteomics
IS - 3
M1 - 100924
ER -