TY - GEN
T1 - Learning a dual-language vector space for domain-specific cross-lingual question retrieval
AU - Chen, Guibin
AU - Chen, Chunyang
AU - Xing, Zhenchang
AU - Xu, Bowen
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/8/25
Y1 - 2016/8/25
N2 - The lingual barrier limits the ability of millions of nonEnglish speaking developers to make effective use of the tremendous knowledge in Stack Overow, which is archived in English. For cross-lingual question retrieval, one may use translation-based methods that first translate the nonEnglish queries into English and then perform monolingual question retrieval in English. However, translation-based methods suffer from semantic deviation due to inappropriate translation, especially for domain-specific terms, and lexical gap between queries and questions that share few words in common. To overcome the above issues, we propose a novel cross-lingual question retrieval based on word embeddings and convolutional neural network (CNN) which are the state-of-the-art deep learning techniques to capture wordand sentence-level semantics. The CNN model is trained with large amounts of examples from Stack Overow duplicate questions and their corresponding translation by machine, which guides the CNN to learn to capture informative word and sentence features to recognize and quantify semantic similarity in the presence of semantic deviations and lexical gaps. A uniqueness of our approach is that the trained CNN can map documents in two languages (e.g., Chinese queries and English questions) in a dual-language vector space, and thus reduce the cross-lingual question retrieval problem to a simple k-nearest neighbors search problem in the dual-language vector space, where no query or question translation is required. Our evaluation shows that our approach significantly outperforms the translation-based method, and can be extended to dual-language documents retrieval from different sources.
AB - The lingual barrier limits the ability of millions of nonEnglish speaking developers to make effective use of the tremendous knowledge in Stack Overow, which is archived in English. For cross-lingual question retrieval, one may use translation-based methods that first translate the nonEnglish queries into English and then perform monolingual question retrieval in English. However, translation-based methods suffer from semantic deviation due to inappropriate translation, especially for domain-specific terms, and lexical gap between queries and questions that share few words in common. To overcome the above issues, we propose a novel cross-lingual question retrieval based on word embeddings and convolutional neural network (CNN) which are the state-of-the-art deep learning techniques to capture wordand sentence-level semantics. The CNN model is trained with large amounts of examples from Stack Overow duplicate questions and their corresponding translation by machine, which guides the CNN to learn to capture informative word and sentence features to recognize and quantify semantic similarity in the presence of semantic deviations and lexical gaps. A uniqueness of our approach is that the trained CNN can map documents in two languages (e.g., Chinese queries and English questions) in a dual-language vector space, and thus reduce the cross-lingual question retrieval problem to a simple k-nearest neighbors search problem in the dual-language vector space, where no query or question translation is required. Our evaluation shows that our approach significantly outperforms the translation-based method, and can be extended to dual-language documents retrieval from different sources.
KW - Convo-lutional Neural Network
KW - Cross-lingual question retrieval
KW - Dual-Language Vector Space
KW - Word embeddings
UR - http://www.scopus.com/inward/record.url?scp=84989204355&partnerID=8YFLogxK
U2 - 10.1145/2970276.2970317
DO - 10.1145/2970276.2970317
M3 - Conference contribution
AN - SCOPUS:84989204355
T3 - ASE 2016 - Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering
SP - 744
EP - 755
BT - ASE 2016 - Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering
A2 - Khurshid, Sarfraz
A2 - Lo, David
A2 - Apel, Sven
PB - Association for Computing Machinery, Inc
T2 - 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016
Y2 - 3 September 2016 through 7 September 2016
ER -