TY - GEN
T1 - A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy
AU - Meisenbacher, Stephen
AU - Chevli, Maulik
AU - Matthes, Florian
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Applications of Differential Privacy (DP) in NLP must distinguish between the syntactic level on which a proposed mechanism operates, often taking the form of word-level or document-level privatization. Recently, several word-level Metric Differential Privacy approaches have been proposed, which rely on this generalized DP notion for operating in word embedding spaces. These approaches, however, often fail to produce semantically coherent textual outputs, and their application at the sentence- or document-level is only possible by a basic composition of word perturbations. In this work, we strive to address these challenges by operating between the word and sentence levels, namely with collocations. By perturbing n-grams rather than single words, we devise a method where composed privatized outputs have higher semantic coherence and variable length. This is accomplished by constructing an embedding model based on frequently occurring word groups, in which unigram words co-exist with bi- and trigram collocations. We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.
AB - Applications of Differential Privacy (DP) in NLP must distinguish between the syntactic level on which a proposed mechanism operates, often taking the form of word-level or document-level privatization. Recently, several word-level Metric Differential Privacy approaches have been proposed, which rely on this generalized DP notion for operating in word embedding spaces. These approaches, however, often fail to produce semantically coherent textual outputs, and their application at the sentence- or document-level is only possible by a basic composition of word perturbations. In this work, we strive to address these challenges by operating between the word and sentence levels, namely with collocations. By perturbing n-grams rather than single words, we devise a method where composed privatized outputs have higher semantic coherence and variable length. This is accomplished by constructing an embedding model based on frequently occurring word groups, in which unigram words co-exist with bi- and trigram collocations. We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.
UR - http://www.scopus.com/inward/record.url?scp=85204436571&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85204436571
T3 - PrivateNLP 2024 - 5th Workshop on Privacy in Natural Language Processing, Proceedings of the Workshop
SP - 39
EP - 51
BT - PrivateNLP 2024 - 5th Workshop on Privacy in Natural Language Processing, Proceedings of the Workshop
A2 - Habernal, Ivan
A2 - Ghanavati, Sepideh
A2 - Ravichander, Abhilasha
A2 - Jain, Vijayanta
A2 - Thaine, Patricia
A2 - Igamberdiev, Timour
A2 - Mireshghallah, Niloofar
A2 - Feyisetan, Oluwaseyi
PB - Association for Computational Linguistics (ACL)
T2 - 5th Workshop on Privacy in Natural Language Processing, PrivateNLP 2024 - Co-located with ACL 2024
Y2 - 15 August 2024
ER -