TY - GEN
T1 - Human-level ordinal maintainability prediction based on static code metrics
AU - Schnappinger, Markus
AU - Fietzke, Arnaud
AU - Pretschner, Alexander
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/6/21
Y1 - 2021/6/21
N2 - One of the greatest challenges in software quality control is the efficient and effective measurement of maintainability. Thorough expert assessments are precise yet slow and expensive, whereas automated static analysis yields imprecise yet rapid feedback. Several machine learning approaches aim to integrate the advantages of both concepts. However, most prior studies did not adhere to expert judgment and predicted the number of changed lines as a proxy for maintainability, or were biased towards a small group of experts. In contrast, the present study builds on a manually labeled and validated dataset. Prediction is done using static code metrics where we found simple structural metrics such as the size of a class and its methods to yield the highest predictive power towards maintainability. Using just a small set of these metrics, our models can distinguish easy from hard to maintain code with an F-score of 91.3% and AUC of 82.3%. In addition, we perform a more detailed ordinal classification and compare the quality of the classification with the performance of experts. Here, we use the deviations between the individual expert's ratings and the eventually determined consensus of all experts. In sum, our models achieve the same level of performance as an average human expert. In fact, the obtained accuracy and mean squared error outperform human performance. We hence argue that our models provide an automated and trustworthy prediction of software maintainability.
AB - One of the greatest challenges in software quality control is the efficient and effective measurement of maintainability. Thorough expert assessments are precise yet slow and expensive, whereas automated static analysis yields imprecise yet rapid feedback. Several machine learning approaches aim to integrate the advantages of both concepts. However, most prior studies did not adhere to expert judgment and predicted the number of changed lines as a proxy for maintainability, or were biased towards a small group of experts. In contrast, the present study builds on a manually labeled and validated dataset. Prediction is done using static code metrics where we found simple structural metrics such as the size of a class and its methods to yield the highest predictive power towards maintainability. Using just a small set of these metrics, our models can distinguish easy from hard to maintain code with an F-score of 91.3% and AUC of 82.3%. In addition, we perform a more detailed ordinal classification and compare the quality of the classification with the performance of experts. Here, we use the deviations between the individual expert's ratings and the eventually determined consensus of all experts. In sum, our models achieve the same level of performance as an average human expert. In fact, the obtained accuracy and mean squared error outperform human performance. We hence argue that our models provide an automated and trustworthy prediction of software maintainability.
KW - Expert Judgment
KW - Machine Learning
KW - Maintainability Prediction
KW - Ordinal Classification
KW - Software Maintainability
UR - http://www.scopus.com/inward/record.url?scp=85108910934&partnerID=8YFLogxK
U2 - 10.1145/3463274.3463315
DO - 10.1145/3463274.3463315
M3 - Conference contribution
AN - SCOPUS:85108910934
T3 - ACM International Conference Proceeding Series
SP - 160
EP - 169
BT - Proceedings of EASE 2021 - Evaluation and Assessment in Software Engineering
PB - Association for Computing Machinery
T2 - 25th Evaluation and Assessment in Software Engineering Conference, EASE 2021
Y2 - 21 June 2021 through 24 June 2021
ER -