Improving Textual Adversarial Attacks using Metric-Guided Rewrite and Rollback - Espace pour le Développement
Pré-Publication, Document De Travail Année : 2024

Improving Textual Adversarial Attacks using Metric-Guided Rewrite and Rollback

Résumé

Adversarial examples are helpful for analyzing and improving the robustness of text classifiers. Generating high-quality adversarial examples is a challenging task as it requires the generation of adversarial sentences that are fluent, semantically similar to the original ones and should lead to misclassification. Existing methods prioritize misclassification by maximizing each perturbation's effectiveness at misleading a text classifier; thus, the generated adversarial examples fall short in terms of fluency and similarity. In this paper, we define a critique score that synthesizes the fluency, similarity, and misclassification metrics. We propose a rewrite and rollback (R&R) framework guided by the optimization of this score to improve the adversarial attack. R&R generates high-quality adversarial examples by allowing exploration of perturbations without immediate impact on the misclassification, and yet optimizing critique score for better fluency and similarity. We evaluate our method on 5 representative datasets and 3 classifier architectures.

Our method outperforms current state-of-theart in attack success rate by +16.2%, +12.8%, and +14.0% on the classifiers respectively. All code and results will be publicly available. 1 Introduction Recently, adversarial attacks in text classification have received a great deal of attention. Adversarial attacks are defined as subtle perturbations in the input text such that a classifier misclassifies it. They can serve as a tool to analyze and improve the robustness of text classifiers, thus being more and more important because security-critical classifiers are being widely deployed.

Fichier principal
Vignette du fichier
10_improving_textual_adversarial_.pdf (907.29 Ko) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04823065 , version 1 (06-12-2024)

Identifiants

  • HAL Id : hal-04823065 , version 1

Citer

Lei Xu, L. Berti-Equille, A. Cuesta-Infante, K. Veeramachaneni.. Improving Textual Adversarial Attacks using Metric-Guided Rewrite and Rollback. 2024. ⟨hal-04823065⟩
0 Consultations
0 Téléchargements

Partager

More