TY - GEN
T1 - DeepMemory
T2 - 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021
AU - Zhu, Derui
AU - Chen, Jinfu
AU - Shang, Weiyi
AU - Zhou, Xuebing
AU - Grossklags, Jens
AU - Hassan, Ahmed E.
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - The neural network model is having a significant impact on many real-world applications. Unfortunately, the increasing popularity and complexity of these models also amplifies their security and privacy challenges, with privacy leakage from training data being one of the most prominent issues. In this context, prior studies proposed to analyze the abstraction behavior of neural network models, e.g., RNN, to understand their robustness. However, the existing research rarely addresses privacy breaches caused by memorization in neural language models. To fill this gap, we propose a novel approach, DeepMemory, that analyzes memorization behavior for a neural language model. We first construct a memorization-analysis-oriented model, taking both training data and a neural language model as input. We then build a semantic first-order Markov model to bind the constructed memorization-analysis-oriented model to the training data to analyze memorization distribution. Finally, we apply our approach to address data leakage issues associated with memorization and to assist in dememorization. We evaluate our approach on one of the most popular neural language models, the LSTM-based language model, with three public datasets, namely, WikiText-103, WMT2017, and IWSLT2016. We find that sentences in the studied datasets with low perplexity are more likely to be memorized. Our approach achieves an average AUC of 0.73 in automatically identifying data leakage issues during assessment. We also show that with the assistance of DeepMemory, data breaches due to memorization of neural language models can be successfully mitigated by mutating training data without reducing the performance of neural language models.
AB - The neural network model is having a significant impact on many real-world applications. Unfortunately, the increasing popularity and complexity of these models also amplifies their security and privacy challenges, with privacy leakage from training data being one of the most prominent issues. In this context, prior studies proposed to analyze the abstraction behavior of neural network models, e.g., RNN, to understand their robustness. However, the existing research rarely addresses privacy breaches caused by memorization in neural language models. To fill this gap, we propose a novel approach, DeepMemory, that analyzes memorization behavior for a neural language model. We first construct a memorization-analysis-oriented model, taking both training data and a neural language model as input. We then build a semantic first-order Markov model to bind the constructed memorization-analysis-oriented model to the training data to analyze memorization distribution. Finally, we apply our approach to address data leakage issues associated with memorization and to assist in dememorization. We evaluate our approach on one of the most popular neural language models, the LSTM-based language model, with three public datasets, namely, WikiText-103, WMT2017, and IWSLT2016. We find that sentences in the studied datasets with low perplexity are more likely to be memorized. Our approach achieves an average AUC of 0.73 in automatically identifying data leakage issues during assessment. We also show that with the assistance of DeepMemory, data breaches due to memorization of neural language models can be successfully mitigated by mutating training data without reducing the performance of neural language models.
KW - Deep learning
KW - Memorization
KW - Model-based analysis
KW - Neural language model
KW - Privacy
UR - http://www.scopus.com/inward/record.url?scp=85125460931&partnerID=8YFLogxK
U2 - 10.1109/ASE51524.2021.9678871
DO - 10.1109/ASE51524.2021.9678871
M3 - Conference contribution
AN - SCOPUS:85125460931
T3 - Proceedings - 2021 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021
SP - 1003
EP - 1015
BT - Proceedings - 2021 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 15 November 2021 through 19 November 2021
ER -