TY - JOUR
T1 - Nala
T2 - Text mining natural language mutation mentions
AU - Cejuela, Juan Miguel
AU - Bojchevski, Aleksandar
AU - Uhlig, Carsten
AU - Bekmukhametov, Rustem
AU - Kumar Karn, Sanjeev
AU - Mahmuti, Shpend
AU - Baghudana, Ashish
AU - Dubey, Ankit
AU - Satagopam, Venkata P.
AU - Rost, Burkhard
N1 - Publisher Copyright:
© The Author 2017. Published by Oxford University Press.
PY - 2017/6/15
Y1 - 2017/6/15
N2 - Motivation: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g.'E6V'), leaving relevant mentions natural language (NL) largely untapped (e.g.'glutamic acid was substituted by valine at residue 6'). Results: We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28-77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. Availability and Implementation: Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+.
AB - Motivation: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g.'E6V'), leaving relevant mentions natural language (NL) largely untapped (e.g.'glutamic acid was substituted by valine at residue 6'). Results: We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28-77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. Availability and Implementation: Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+.
UR - http://www.scopus.com/inward/record.url?scp=85021376614&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btx083
DO - 10.1093/bioinformatics/btx083
M3 - Article
C2 - 28200120
AN - SCOPUS:85021376614
SN - 1367-4803
VL - 33
SP - 1852
EP - 1858
JO - Bioinformatics
JF - Bioinformatics
IS - 12
ER -