TY - JOUR
T1 - SEthesaurus
T2 - WordNet in software engineering
AU - Chen, Xiang
AU - Chen, Chunyang
AU - Zhang, Dun
AU - Xing, Zhenchang
N1 - Publisher Copyright:
© 1976-2012 IEEE.
PY - 2021/9/1
Y1 - 2021/9/1
N2 - Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process (NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations. Finally, we perform graph analysis on morphological relations. We evaluate the coverage and accuracy of our constructed thesaurus against community-cumulated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our constructed thesaurus by developing three applications and also verify the generality of our approach in constructing thesauruses from data sources in other domains.
AB - Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process (NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations. Finally, we perform graph analysis on morphological relations. We evaluate the coverage and accuracy of our constructed thesaurus against community-cumulated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our constructed thesaurus by developing three applications and also verify the generality of our approach in constructing thesauruses from data sources in other domains.
KW - morphological form
KW - natural language processing
KW - Software-specific thesaurus
KW - word embedding
UR - http://www.scopus.com/inward/record.url?scp=85072213263&partnerID=8YFLogxK
U2 - 10.1109/TSE.2019.2940439
DO - 10.1109/TSE.2019.2940439
M3 - Article
AN - SCOPUS:85072213263
SN - 0098-5589
VL - 47
SP - 1960
EP - 1979
JO - IEEE Transactions on Software Engineering
JF - IEEE Transactions on Software Engineering
IS - 9
ER -