Seeing Isn’t Believing: How DocVQA Systems Can Be Fooled

Author: Denis Avetisyan

New research reveals that even advanced document visual question answering systems are susceptible to subtle visual manipulations that can alter their responses.

A document visual question answering model, despite correct responses to unaltered invoices, is demonstrably vulnerable to adversarial perturbations - even a simple patch can force a preselected, and potentially financially damaging, incorrect answer, such as reporting a total of $0.00. — A document visual question answering model, despite correct responses to unaltered invoices, is demonstrably vulnerable to adversarial perturbations – even a simple patch can force a preselected, and potentially financially damaging, incorrect answer, such as reporting a total of $0.00.

This study demonstrates successful adversarial attacks against OCR-free document VQA models, highlighting vulnerabilities in end-to-end multimodal systems.

Despite advances in Document Visual Question Answering (DocVQA), current systems remain susceptible to subtle manipulation. This work, ‘Counterfeit Answers: Adversarial Forgery against OCR-Free Document Visual Question Answering’, introduces a novel attack demonstrating that visually imperceptible forgeries can induce incorrect or targeted answers from state-of-the-art, OCR-free DocVQA models. We show that carefully crafted perturbations to document images can reliably mislead these systems, even without relying on text recognition. How can we build more robust DocVQA systems capable of discerning genuine document content from adversarial manipulations?

The Inherent Fragility of Document Understanding Systems

Document Visual Question Answering (DocVQA) systems, designed to interpret and respond to questions about document content, are fundamentally reliant on Optical Character Recognition (OCR) technology. This creates a critical vulnerability; the entire process begins with converting images of text into machine-readable characters, and any inaccuracies in this initial step cascade through subsequent layers of analysis. Essentially, if the OCR misinterprets a character or word, the DocVQA model receives flawed information, impacting its ability to correctly understand the document and provide an accurate answer. This dependency on OCR introduces a significant point of failure, limiting the overall robustness and reliability of even the most advanced DocVQA systems, especially when confronted with imperfect or challenging document images.

The accuracy of Document Visual Question Answering (DocVQA) systems is fundamentally challenged by the cascading effect of Optical Character Recognition (OCR) errors. Even minor inaccuracies in text extraction – a misplaced character or a misread word – become amplified as the information progresses through subsequent processing stages, such as semantic understanding and reasoning. Consequently, these propagated errors significantly diminish the robustness of DocVQA models, leading to incorrect answers and unreliable performance, particularly when encountering documents with poor image quality, unusual layouts, or complex typography. This sensitivity highlights a critical vulnerability: a DocVQA system can only be as reliable as its initial OCR transcription, creating a demonstrable bottleneck in achieving truly intelligent document understanding.

The inherent reliance on Optical Character Recognition (OCR) establishes a clear performance bottleneck in document understanding systems, becoming acutely pronounced when processing documents of suboptimal quality or possessing intricate layouts. Degraded images-those affected by noise, blur, or distortion-present significant challenges for OCR algorithms, leading to increased error rates and diminished text extraction accuracy. Similarly, complex documents-featuring multi-column formats, tables, or handwritten elements-often confound standard OCR processes, hindering their ability to correctly identify and interpret textual information. These OCR-induced errors aren’t isolated; they propagate throughout the entire document analysis pipeline, ultimately limiting the overall precision and reliability of the system’s ability to answer questions or extract meaningful insights from the document content. Consequently, improving robustness to variations in document quality and complexity is critical for advancing the field of document understanding.

This example demonstrates a targeted multi-answer attack optimized across five questions within a full-document context.

Beyond Transcription: A Paradigm Shift in Document Understanding

Traditional Document Visual Question Answering (DocVQA) systems typically employ Optical Character Recognition (OCR) to convert document images into machine-readable text, which is then used as input for question answering models. OCR-free DocVQA techniques bypass this intermediate text extraction step by utilizing deep learning architectures capable of directly processing document images. This end-to-end training approach allows the model to learn visual features and their relationships to answer questions without the potential errors introduced by OCR, resulting in a more streamlined and potentially more accurate system. These models learn to map visual document representations directly to answers, eliminating the need for a separate text recognition phase and its associated complexities.

Pix2Struct and Donut represent a class of Document Visual Question Answering (DocVQA) models that bypass traditional Optical Character Recognition (OCR) steps by directly processing document images as pixel data. These models employ sequence-to-sequence architectures, typically utilizing transformers, to learn mappings between visual document features and corresponding answer sequences. Pix2Struct frames the problem as a structured prediction task, while Donut leverages a unified approach with a transformer encoder-decoder architecture. Both models are trained end-to-end, allowing them to learn relevant visual features – such as layout, formatting, and table structures – directly from the image data, without the potential errors or information loss inherent in OCR-based pipelines.

OCR-free DocVQA methods demonstrate improved accuracy and robustness compared to traditional OCR-based pipelines, particularly when processing complex document types. Traditional systems are vulnerable to errors introduced during the OCR process, which are then propagated to the question answering stage. Documents with low image quality, unusual fonts, or complex layouts frequently cause OCR failures. By directly processing document images, OCR-free models avoid these intermediate text extraction errors, learning to associate visual features with answers. This end-to-end approach allows the model to implicitly correct for distortions and noise, resulting in more reliable performance on challenging document types such as historical manuscripts, forms, and tables.

Performance metrics reveal that both Pix2Struct and Donut models demonstrate improved accuracy (ASR, CDMG) and natural language similarity (ANLS) with increased optimization of question-answer pairs.

The Persistent Threat: Adversarial Vulnerabilities in Visual Reasoning

OCR-free Document Visual Question Answering (DocVQA) models, while eliminating the vulnerabilities associated with Optical Character Recognition, are still vulnerable to adversarial perturbations. These perturbations consist of carefully designed, often imperceptible, modifications to the input document image. Such manipulations do not alter the semantic content as perceived by a human, but can nonetheless cause the model to produce incorrect answers. The susceptibility stems from the model’s reliance on visual features and its learned associations between those features and correct responses; even small changes to these features can disrupt the model’s decision-making process. The effect is similar to adversarial attacks on image classification models, but applied to the domain of document understanding, and can be achieved through pixel-level modifications or more complex transformations.

A formalized threat model for OCR-free Document Visual Question Answering (DocVQA) systems is essential for systematically identifying potential vulnerabilities and guiding the development of robust defenses. This model must delineate the attacker’s capabilities – including knowledge of the system architecture, access to training data, and permissible manipulation of input documents – and define the attack surface, encompassing all potential entry points for malicious input. Specifically, the threat model should categorize attacks based on their goals – such as causing misclassification, extracting sensitive information, or denial-of-service – and articulate measurable security goals, like maintaining a specified accuracy level under adversarial conditions. Without a clearly defined threat model, evaluating the effectiveness of defense mechanisms and comparing the security of different DocVQA architectures becomes significantly more difficult, hindering progress in securing these systems against real-world attacks.

Adversarial attacks leveraging perturbations in image preprocessing steps, specifically resizing operations, represent a significant vulnerability in OCR-free Document Visual Question Answering (DocVQA) systems. These attacks exploit the sensitivity of deep learning models to input variations; even minor alterations during resizing – such as slight scaling or shifts – can introduce distortions that propagate through the model, leading to incorrect answers. Performance degradation is directly correlated to the magnitude and nature of these perturbations, with even imperceptible changes capable of inducing substantial errors. This attack vector is critical because preprocessing is often assumed to be a benign step, and defenses are not typically applied at this level, leaving systems exposed to relatively simple manipulations.

Unlike traditional attacks broken by non-differentiable preprocessing, our method reconstructs the preprocessing steps to enable a fully differentiable, end-to-end attack.

Empirical Assessment: Quantifying Robustness Under Adversarial Conditions

Evaluation of OCR-free Document Visual Question Answering (DocVQA) models on the PFL-DocVQA dataset revealed significant vulnerabilities to adversarial attacks. Utilizing Projected Gradient Descent (PGD), researchers successfully generated imperceptible perturbations to input documents, consistently manipulating the models’ predicted answers with a near 100% success rate under specific conditions. This indicates that even minor, carefully crafted alterations to the document image are sufficient to induce incorrect responses, highlighting a critical weakness in current OCR-free DocVQA architectures and their reliance on visual features without robust error correction mechanisms.

Average Normalized Levenshtein Similarity (ANLS) was utilized to assess the degree of alteration in predicted answers following adversarial attacks. ANLS calculates the similarity between the predicted answer and the ground truth, normalizing the Levenshtein distance – the minimum number of edits needed to change one string into the other – by the length of the longer string. During experimentation, a consistently high ANLS score was observed for non-targeted questions subjected to adversarial perturbations, indicating that the semantic content of the predicted answers remained largely unchanged despite the manipulations. This suggests the attacks, while capable of altering the answer string, did not significantly impact the model’s ability to extract correct information from these question types.

Current OCR-free Document Visual Question Answering (DocVQA) models demonstrate vulnerability to adversarial attacks, necessitating the development of robust defense mechanisms. Experiments indicate that while single question-answer pair manipulation can be highly effective, the Donut model exhibits a significantly improved resilience to attacks targeting multiple question-answer pairs concurrently, achieving a near 0% success rate under such conditions. This suggests that architectural features or training methodologies within the Donut model provide a degree of inherent robustness against coordinated adversarial perturbations, and further investigation into these aspects could inform the design of more secure DocVQA systems.

Performance metrics, including automatic speech recognition (ASR), character-level document matching (CDMG), and normalized edit length score (ANLS), demonstrate the impact of varying optimized question-answer (QA) pairs on the Pix2Struct and Donut models.

Towards Resilient Document AI: Charting a Path for Future Research

Contemporary document AI security research largely centers on generating adversarial examples – subtle alterations to documents designed to mislead AI systems – tailored to the specific characteristics of each individual file. While effective against targeted attacks, this approach struggles with scalability and real-world applicability. A promising, yet significantly more difficult, avenue lies in identifying universal perturbations – minimal, document-agnostic modifications that consistently disrupt the performance of document AI models across a diverse range of inputs. Discovering such perturbations requires overcoming substantial hurdles, including the high dimensionality of document data and the complexity of neural network architectures, but success would yield a far more robust defense strategy, capable of proactively mitigating threats without requiring per-document analysis. The development of these generalized adversarial patterns represents a key frontier in building document AI systems resilient to unforeseen attacks and capable of reliable performance in dynamic environments.

The development of truly robust Document AI hinges on the creation of end-to-end differentiable pipelines. These systems allow for the seamless propagation of gradients from the output of a document processing model – such as text extraction or table understanding – back through all stages of the pipeline, including image preprocessing and layout analysis. This complete differentiability is crucial for crafting effective adversarial attacks, where subtle, strategically designed perturbations are applied to documents to intentionally mislead the AI. More importantly, it enables the development of corresponding defenses by allowing researchers to directly optimize the model’s resistance to these attacks. Without this holistic, gradient-based approach, optimizing for robustness becomes a significantly more challenging and less effective endeavor, limiting the potential for deploying Document AI in sensitive, real-world applications where reliability is paramount.

The development of truly robust document AI hinges on overcoming current vulnerabilities to even subtle alterations in input data. Successfully navigating the challenges of universal perturbations and establishing end-to-end differentiability will unlock systems capable of performing reliably in unpredictable, real-world conditions. These advancements promise to move beyond carefully curated datasets and controlled environments, allowing document AI to function effectively with scanned paperwork, low-resolution images, and diverse document layouts. Such resilient systems are crucial for applications ranging from automated financial processing and legal document review to healthcare data extraction and government services, ultimately fostering greater trust and wider adoption of this transformative technology.

A targeted adversarial patch applied to the lower right corner of the document successfully altered its classification with a perturbation of ϵ=96.

The pursuit of robustness in Document Visual Question Answering systems, as highlighted in the study, reveals a fundamental truth about algorithmic design. Even architectures that bypass traditional Optical Character Recognition-and therefore seemingly sidestep a common vulnerability-remain susceptible to manipulation through adversarial attacks. This echoes Andrew Ng’s assertion: “The pattern of how you get things done is more important than what you get done.” The elegance of an OCR-free approach is irrelevant if the underlying system’s decision-making process lacks consistency and is vulnerable to subtly crafted perturbations. The focus must shift from simply achieving a correct answer on a given dataset to ensuring the provability of the algorithm’s integrity against all possible inputs, however cleverly disguised.

What Lies Ahead?

The demonstrated vulnerability of end-to-end Document Visual Question Answering systems to subtly crafted adversarial perturbations is not merely a technical curiosity, but a fundamental indictment of current architectural approaches. The reliance on correlation, rather than true comprehension, is laid bare. These systems, despite achieving impressive performance on benchmark datasets, exhibit a fragility that belies any claim of genuine intelligence. The fact that such attacks succeed even without utilizing Optical Character Recognition – bypassing the traditionally assumed point of failure – is particularly damning.

Future work must move beyond empirical defenses and focus on provable robustness. Patch-based attacks, while effective, represent a relatively weak form of adversarial manipulation. The true challenge lies in crafting perturbations that are not merely visually imperceptible, but also mathematically indistinguishable from natural document variations. A robust system should not be defined by its ability to withstand specific attacks, but by its adherence to fundamental principles of information processing – a verifiable guarantee of correctness, not merely a high score on a test set.

Ultimately, the field requires a shift in focus. The pursuit of ever-larger datasets and more complex models is a distraction. The elegance of an algorithm is not measured in the number of parameters, but in its asymptotic behavior and scalability. True progress will come from identifying the minimal set of axioms necessary for document understanding, and constructing systems that are provably consistent with those axioms.

Original article: https://arxiv.org/pdf/2512.04554.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/