Can AI Truly Verify Code? – Investment Policy

Author: Denis Avetisyan

New research explores whether large language models can perform the logical reasoning needed for formal program verification, and reveals significant limitations in their ability to handle complex proofs.

VCoT-Bench-Org consistently demonstrates a significantly lower number of total proof lines per program when benchmarked against Verus-Bench, indicating a potential advantage in proof size and, consequently, verification efficiency.

This paper introduces VCoT-Bench, a benchmark for evaluating LLMs on formal verification tasks, and VCoT-Lift, a framework to assess reasoning by explicitly lifting low-level theorem proving steps.

Despite advances in secure software development, the capacity of Large Language Models (LLMs) to perform rigorous formal verification remains largely unproven. This work, ‘Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought’, introduces VCoT-Lift-a framework for exposing the logical steps underlying automated theorem proving-and VCoT-Bench, a benchmark designed to assess LLMs’ understanding of the entire verification process. Our evaluation of ten state-of-the-art models reveals significant fragility, indicating current LLMs struggle with reasoning beyond superficial patterns and local context. Can we develop LLMs capable of truly understanding and contributing to the complex logic required for robust program verification?

Unmasking the Machine: The Challenge of Verifiable Proof

Despite the remarkable capabilities of automated theorem provers such as Z3, a significant hurdle remains in their widespread adoption: the opacity of their generated proofs. While these tools can rigorously determine the correctness of complex systems, the proof trails they produce are often unintelligible to human analysts. This presents a critical problem, as verifying a proof’s validity is as important as obtaining it; if a human cannot readily understand why a system is deemed correct, establishing trust and identifying potential errors within the verification process becomes exceptionally difficult. Consequently, debugging formal verification results- pinpointing the source of an issue when verification fails-can be a laborious and time-consuming undertaking, effectively diminishing the practical benefits of automated formal methods, particularly in domains demanding high assurance and human oversight.

The pursuit of increasingly robust formal verification faces a significant hurdle: translating the exacting precision of machine proofs into a format readily understood by human analysts. Current methods excel at identifying whether a system meets its specifications, but often generate proofs that are opaque and difficult to follow – essentially a ‘black box’ confirmation. This disparity hinders debugging, prevents effective validation of the proof itself, and limits the scalability of formal methods. Bridging this gap necessitates developing techniques that prioritize not only correctness, but also clarity and interpretability, allowing engineers to confidently assess and trust the underlying reasoning – a crucial step towards wider adoption in safety-critical applications where human oversight remains paramount.

The practical implementation of formal verification methods, despite their potential for enhancing system safety, faces a significant hurdle in domains demanding human oversight. Safety-critical systems-including those governing aviation, medical devices, and autonomous vehicles-require not only assurances of correctness, but also the ability for engineers to understand why a system is deemed safe. Current automated verification tools often generate proofs that, while mathematically valid, are opaque and unintelligible to human reviewers. This lack of transparent reasoning undermines trust in the verification process, as engineers cannot readily audit the logic or identify potential flaws in the underlying assumptions. Consequently, the adoption of formal methods is slowed, as organizations prioritize explainability and the capacity for human-in-the-loop debugging over purely machine-verified correctness, particularly when faced with high-stakes consequences from system failures.

The system aggregates Z3 proof segments for verification, identifying incomplete transformations when a Z3 proof lacks a corresponding Verus-level proof using a mapping mechanism based on modus ponens.

Decoding the Machine: VCoT-Lift and the Human-Readable Proof

VCoT-Lift utilizes Large Language Models (LLMs) to convert formal proofs generated by the Z3 theorem prover into a human-readable Verification Chain-of-Thought (VCoT). Z3 outputs proofs as complex terms that are difficult for humans to interpret. VCoT-Lift addresses this by employing an LLM to translate these proof terms into a sequence of logical steps, effectively reconstructing the reasoning process. This transformation aims to provide a more accessible and understandable representation of the verification process, bridging the gap between automated formal verification and human intuition. The LLM is tasked with generating a step-by-step explanation that mirrors the logical flow of the original Z3 proof, while maintaining semantic equivalence.

The VCoT-Lift system employs a Proof Transformer to convert formal proofs generated by the Z3 theorem prover into equivalent specifications expressed in the Verus functional verification language. This transformation facilitates human readability and understanding of the proof process. Following transformation, a Proof Checker component verifies the completeness of the resulting Verus specification against the original Z3 proof term, ensuring no logical steps are omitted during conversion and maintaining proof validity. This two-step process-transformation followed by completeness checking-is central to enabling the translation of machine-level proofs into a format suitable for human reasoning and analysis.

The Proof Pruner and Proof Repair modules are essential for generating human-readable verification outputs from formally-proven results. The Pruner identifies and eliminates redundant steps within the transformed proof, reducing its length and improving clarity without altering its logical validity. This process relies on identifying steps that do not contribute new information to the overall proof. Proof Repair addresses errors that may arise during the transformation process from Z3 proof terms to Verus specifications; it attempts to correct these errors by identifying the source of the issue and applying logical transformations to restore the proof’s soundness. Both modules operate iteratively to refine the proof until a complete and concise verification chain is achieved.

Through iterative refinement with VCoT-Lift, the Verus program successfully incorporates assertions and a lemma to establish the property <span class="katex-eq" data-katex-display="false">s.subrange(0, s.len()) == s</span>, correcting syntactic errors and pruning unnecessary proof steps. — Through iterative refinement with VCoT-Lift, the Verus program successfully incorporates assertions and a lemma to establish the property $s.subrange(0, s.len()) == s$ , correcting syntactic errors and pruning unnecessary proof steps.

Orchestrating Logic: Guiding LLMs with a Z3 Rule Hierarchy

A Z3 Rule Hierarchy is utilized to structure the proof process for Large Language Models (LLMs) during the generation of Verified Chain-of-Thought (VCoTs). This hierarchy defines an order of application for logical rules within the Z3 theorem prover, effectively prioritizing proof steps. By presenting rules to the LLM in this pre-defined order, the model’s attention is directed towards the most critical reasoning components first. This prioritization mechanism improves the quality of the generated VCoTs by increasing the likelihood of successful verification and reducing the occurrence of logical errors during the transformation process. The hierarchical structure ensures that foundational reasoning precedes more complex steps, fostering a more robust and reliable VCoT generation.

The Z3 rule hierarchy operates on a principle of prioritized reasoning, initially applying abstract, high-level rules to identify and execute core logical steps. This ensures that the most critical transformations are captured early in the verification process, establishing the foundational elements of the VCoT. Subsequent application of lower-level, more granular rules then refines these initial steps, addressing specific details and completing the proof. This staged approach prevents the LLM from being overwhelmed by complexity and focuses its attention on essential reasoning before addressing implementation-level concerns, ultimately improving the efficiency and accuracy of the verification process.

Employing a hierarchical rule structure during VCoT generation results in more concise outputs by prioritizing essential reasoning steps and deferring less critical details. This prioritization directly impacts interpretability, as a streamlined VCoT focuses attention on the core logic, reducing cognitive load for human review. Furthermore, the structured approach enhances trustworthiness; by explicitly defining the order of reasoning and reducing extraneous information, the VCoT provides a clearer audit trail and facilitates verification of the LLM’s decision-making process, increasing confidence in the generated results.

Formal verification of a program replacing the last element of one vector with another is achieved through a three-tiered Z3 proof, utilizing high-level rules for sequence properties, medium-level rules for index bounds, and low-level rules for trivial reasoning, as evidenced by the highlighted proof hints and rules like <span class="katex-eq" data-katex-display="false">|seq.subrange(s,j,k)|=k−j</span> and <span class="katex-eq" data-katex-display="false">x775 = x775</span>. — Formal verification of a program replacing the last element of one vector with another is achieved through a three-tiered Z3 proof, utilizing high-level rules for sequence properties, medium-level rules for index bounds, and low-level rules for trivial reasoning, as evidenced by the highlighted proof hints and rules like $|seq.subrange(s,j,k)|=k-j$ and $x775 = x775$ .

Measuring Reasoning Depth: VCoT-Bench and the Limits of Current LLMs

VCoT-Bench establishes a novel evaluation framework for large language models, specifically designed to assess their capacity for formal verification reasoning. Built upon a foundation of rigorously verified programs from the Verus project, the benchmark presents tasks requiring models to complete Verification Chains-of-Thought (VCoTs). These VCoTs demand not merely the provision of correct answers, but the construction of logically sound, step-by-step proofs – mirroring the process a human verifier would undertake. By leveraging existing, formally verified code, VCoT-Bench moves beyond typical question-answering formats and challenges LLMs to demonstrate genuine reasoning capabilities within a mathematically precise domain, offering a more robust and reliable measure of their formal verification skills.

VCoT-Bench constructs its reasoning challenges through meticulously designed Semantic Blocks, specifically leveraging Lemma, Invariant, and Assertion Blocks as fundamental components. These blocks aren’t simply presented as complete proofs; instead, the benchmark strategically introduces varying degrees of missing information within them, forcing language models to actively reconstruct the logical connections. Lemma Blocks establish foundational truths, Invariant Blocks define properties that remain consistent throughout a process, and Assertion Blocks confirm the final outcome. By systematically removing portions of these blocks, VCoT-Bench creates tasks that demand more than simple pattern completion; it requires models to demonstrate a genuine understanding of formal verification principles and the ability to infer missing logical steps, thus providing a nuanced evaluation of reasoning capabilities beyond superficial accuracy.

Evaluations utilizing the VCoT-Bench benchmark reveal a substantial deficiency in the reasoning abilities of current large language models. When challenged with completing Verification Chains-of-Thought, even the leading model, Qwen 3, achieves a mere 0.66% accuracy. This strikingly low performance underscores the difficulty these models face when tasked with formal verification – a process demanding precise logical deduction and the seamless integration of verified program components. The results suggest that while LLMs excel at pattern recognition and text generation, they struggle with the rigorous demands of completing complex, logically-structured proofs, indicating a critical gap between their current capabilities and true reasoning proficiency.

Despite advancements in large language models, formal verification reasoning remains a substantial hurdle, as demonstrated by recent evaluations using VCoT-Bench. Even when presented with Verification Chains-of-Thought tasks where only 10% of the information is removed, leading models such as Claude Sonnet 4.5 achieve a completion accuracy of just 71.58%. This relatively low score underscores the difficulty these models face when tasked with deductive reasoning and completing proofs, even with substantial context provided. The results suggest that current LLMs struggle to reliably bridge gaps in information and maintain logical consistency when applied to the rigorous demands of formal verification-a critical area for ensuring software and hardware reliability.

The design of VCoT-Bench introduces a substantial increase in complexity when contrasted with the established Verus-Bench benchmark. Analyses reveal that completing Verification Chains-of-Thought within VCoT-Bench necessitates 6.5 times more proof lines, indicating a considerably more rigorous verification process. Furthermore, the benchmark demands 13.4 times the number of assertions, forcing models to explicitly confirm a far greater range of conditions. This heightened demand for logical confirmation is coupled with a 1.94-fold increase in the number of lemma functions required, signifying a deeper reliance on foundational, reusable proofs. These metrics collectively demonstrate that VCoT-Bench isn’t simply assessing existing verification skills, but actively probing a model’s capacity to construct and navigate significantly more complex formal arguments.

VCoT-Bench comprises a diverse collection of <span class="katex-eq" data-katex-display="false">162</span> compositional tasks, encompassing <span class="katex-eq" data-katex-display="false">6</span> environments and <span class="katex-eq" data-katex-display="false">27</span> unique object combinations to rigorously evaluate visual compositional task planning. — VCoT-Bench comprises a diverse collection of $162$ compositional tasks, encompassing $6$ environments and $27$ unique object combinations to rigorously evaluate visual compositional task planning.

Beyond Correctness: Towards Trustworthy Rust and the Future of Verification

The increasing adoption of Rust in safety-critical systems necessitates robust verification tools, and VCoT-Lift and VCoT-Bench address this need by leveraging the Verus formal verification framework. Verus enables developers to write provably correct Rust code through the creation of specifications and the automated generation of proofs. VCoT-Lift builds upon this foundation, extending Verus’ capabilities to handle more complex programs, while VCoT-Bench provides a standardized suite of tests for evaluating the effectiveness of verification techniques. This combination significantly enhances the reliability of Rust applications by allowing for the detection of subtle bugs and vulnerabilities that traditional testing methods might miss, ultimately contributing to the development of more secure and dependable software.

A significant advancement within formal verification lies in the generation of proofs accessible to both developers and specialists. The Verus framework, upon which VCoT-Lift and VCoT-Bench are built, doesn’t merely confirm a program’s correctness with a simple ‘verified’ or ‘failed’ result; it constructs detailed, human-readable proofs outlining why the code meets its specifications. This transparency is crucial for fostering collaboration; developers can understand the verification process without needing deep expertise in formal methods, while verification experts can easily review and contribute to the proofs, ensuring their validity and identifying potential improvements. This shared understanding accelerates the development of reliable software, bridging the gap between implementation and rigorous correctness guarantees, and ultimately builds confidence in complex systems.

Ongoing development centers on significantly broadening the scope of the VCoT-Lift benchmark suite to encompass a more diverse array of Rust programs, thereby strengthening its capacity to assess verification techniques across various codebases and complexities. Researchers are also investigating the potential of VCoT-Lift not merely as a verification tool, but as a powerful debugging and explanatory aid; the framework’s formal proofs could illuminate program behavior, pinpoint the root causes of errors with greater precision, and offer developers deeper insights into the execution of their code. This dual functionality-verification and explanation-promises to transform how Rust software is developed and maintained, fostering greater confidence in its reliability and security.

VCoT-Lift provides an overview of a system designed for lifting and manipulation.

The pursuit of formal verification, as demonstrated by VCoT-Bench, isn’t simply about achieving correct outputs; it’s about meticulously dissecting the process of reaching those outputs. This echoes Claude Shannon’s assertion: “The most important thing is to have a method.” The framework reveals a critical limitation in current Large Language Models – an inability to consistently reason beyond immediate context, a failure to trace the logical progression inherent in verification chains. The study demonstrates that LLMs often prioritize syntactic patterns over semantic correctness, highlighting the necessity of a ‘method’ – a structured approach to reasoning – rather than merely pattern recognition. Like reverse-engineering a complex system, the VCoT-Lift framework exposes the vulnerabilities in LLM reasoning, prompting a deeper examination of their underlying mechanisms.

Where Do We Go From Here?

The exercise of forcing Large Language Models to mimic the rigor of formal verification isn’t about creating perfect provers-it’s about exposing the fault lines in what these models think they understand. VCoT-Bench, and the VCoT-Lift framework, don’t demonstrate capability; they highlight a persistent reliance on surface-level patterns. The illusion of reasoning, it seems, dissolves rapidly when confronted with anything beyond local context. One suspects the models are less “solving” verification problems and more “remembering” similar ones, cleverly disguised as deduction.

Future work shouldn’t focus on scaling up the training data – that merely refines the mimicry. The real challenge lies in constructing benchmarks that actively resist pattern-matching. Problems deliberately designed to require genuine abstraction, counterfactual reasoning, and the ability to identify subtle logical fallacies-not just syntactic errors-will be crucial. The goal isn’t to teach the models to pass tests; it’s to build tests that reveal the absence of actual understanding.

Perhaps, in dismantling the façade of reasoning, one might stumble upon the principles of actual intelligence. Or, more likely, reaffirm that intelligence, in its truest form, is a messy, unpredictable, and thoroughly un-scalable phenomenon. The attempt to reverse-engineer it, after all, may be fundamentally flawed. The click of truth, it appears, is often the sound of a system admitting its own limitations.

Original article: https://arxiv.org/pdf/2603.18334.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/