Can AI Truly Verify Code?

VCoT-Bench-Org consistently demonstrates a significantly lower number of total proof lines per program when benchmarked against Verus-Bench, indicating a potential advantage in proof size and, consequently, verification efficiency.

New research explores whether large language models can perform the logical reasoning needed for formal program verification, and reveals significant limitations in their ability to handle complex proofs.