TY - JOUR
T1 - Soft Prompt Threats
T2 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024
AU - Schwinn, Leo
AU - Dobre, David
AU - Xhonneux, Sophie
AU - Gidel, Gauthier
AU - Günnemann, Stephan
N1 - Publisher Copyright:
© 2024 Neural information processing systems foundation. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models. However, this approach neglects the steady progression of open-source models. As open-source models advance in capability, ensuring their safety becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Additionally, we demonstrate that models compromised by embedding attacks can be used to create discrete jailbreaks in natural language. Lastly, we present a novel threat model in the context of unlearning and data extraction and show that embedding space attacks can extract supposedly deleted information from unlearned models, and to a certain extent, even recover pretraining data in LLMs. Our findings highlight embedding space attacks as an important threat model in open-source LLMs.
AB - Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models. However, this approach neglects the steady progression of open-source models. As open-source models advance in capability, ensuring their safety becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Additionally, we demonstrate that models compromised by embedding attacks can be used to create discrete jailbreaks in natural language. Lastly, we present a novel threat model in the context of unlearning and data extraction and show that embedding space attacks can extract supposedly deleted information from unlearned models, and to a certain extent, even recover pretraining data in LLMs. Our findings highlight embedding space attacks as an important threat model in open-source LLMs.
UR - http://www.scopus.com/inward/record.url?scp=105000485358&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:105000485358
SN - 1049-5258
VL - 37
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
Y2 - 9 December 2024 through 15 December 2024
ER -