Improving Textual Adversarial Attacks using Metric-Guided Rewrite and Rollback

Lei Xu; L. Berti-Equille; A. Cuesta-Infante; K. Veeramachaneni.

Pré-Publication, Document De Travail Année : 2024

Improving Textual Adversarial Attacks using Metric-Guided Rewrite and Rollback

(1) , (2) , (3) , (1)

1
2
3

Lei Xu

Fonction : Auteur

Laboratory for Information and Decision Systems - Massachusetts Institute of Technology

L. Berti-Equille

Fonction : Auteur

UMR 228 Espace-Dev, Espace pour le développement

A. Cuesta-Infante

Fonction : Auteur

Universidad Rey Juan Carlos = Rey Juan Carlos University

K. Veeramachaneni.

Fonction : Auteur

Laboratory for Information and Decision Systems - Massachusetts Institute of Technology

Résumé

Adversarial examples are helpful for analyzing and improving the robustness of text classifiers. Generating high-quality adversarial examples is a challenging task as it requires the generation of adversarial sentences that are fluent, semantically similar to the original ones and should lead to misclassification. Existing methods prioritize misclassification by maximizing each perturbation's effectiveness at misleading a text classifier; thus, the generated adversarial examples fall short in terms of fluency and similarity. In this paper, we define a critique score that synthesizes the fluency, similarity, and misclassification metrics. We propose a rewrite and rollback (R&R) framework guided by the optimization of this score to improve the adversarial attack. R&R generates high-quality adversarial examples by allowing exploration of perturbations without immediate impact on the misclassification, and yet optimizing critique score for better fluency and similarity. We evaluate our method on 5 representative datasets and 3 classifier architectures.

Our method outperforms current state-of-theart in attack success rate by +16.2%, +12.8%, and +14.0% on the classifiers respectively. All code and results will be publicly available. 1 Introduction Recently, adversarial attacks in text classification have received a great deal of attention. Adversarial attacks are defined as subtle perturbations in the input text such that a classifier misclassifies it. They can serve as a tool to analyze and improve the robustness of text classifiers, thus being more and more important because security-critical classifiers are being widely deployed.

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

10_improving_textual_adversarial_.pdf (907.29 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Laure Berti-Equille : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04823065

Soumis le : vendredi 6 décembre 2024-11:53:32

Dernière modification le : vendredi 13 décembre 2024-03:19:09

Dates et versions

hal-04823065 , version 1 (06-12-2024)

Identifiants

HAL Id : hal-04823065 , version 1

Citer

Lei Xu, L. Berti-Equille, A. Cuesta-Infante, K. Veeramachaneni.. Improving Textual Adversarial Attacks using Metric-Guided Rewrite and Rollback. 2024. ⟨hal-04823065⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

IRD UNIV-AVIGNON UNIV-AG AFRIQ UNIV-PERP ESPACE-DEV GUYANE UNIV-MONTPELLIER

0 Consultations

0 Téléchargements

Improving Textual Adversarial Attacks using Metric-Guided Rewrite and Rollback

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager