AI Learns to Find Software Bugs with Human-Like Reasoning

Author: Denis Avetisyan

A new framework combines the power of artificial intelligence with formal verification techniques to autonomously discover vulnerabilities in software code.

The distribution of finding severity varies predictably across configurations, suggesting that system architecture fundamentally shapes the nature and prevalence of emergent failures.

QRS leverages large language models to synthesize CodeQL queries, achieving higher accuracy and uncovering previously unknown software flaws.

Despite advances in Static Application Security Testing (SAST), current tools struggle with false positives and are limited to predefined vulnerability patterns. This limitation motivates the development of ‘QRS: A Rule-Synthesizing Neuro-Symbolic Triad for Autonomous Vulnerability Discovery’, a novel neuro-symbolic framework that inverts the traditional SAST paradigm by employing Large Language Models to generate CodeQL queries and refine vulnerability detection. Evaluated on real-world Python packages, QRS achieves high accuracy and uncovers previously unknown vulnerabilities, including findings corroborated by both new CVE assignments and independent researcher discovery. Could this approach usher in a new era of autonomous, adaptable software security testing that surpasses the limitations of existing rule-based systems?

The Expanding Fracture: A System’s Inevitable Exposure

The expanding digital landscape presents software with a continually growing attack surface, fundamentally altering the risk profile for applications and systems. Modern software isn’t simply code; it’s a complex interplay of applications, libraries, APIs, and network connections, each representing a potential entry point for malicious actors. Critically, vulnerabilities such as Remote Code Execution – allowing attackers to control systems remotely – and Path Traversal – enabling unauthorized access to files – are not isolated incidents but rather consistently emerging threats. These flaws arise from increasing code complexity, the rapid adoption of open-source components, and the constant pressure to deliver features quickly. Consequently, software is perpetually exposed to new and evolving exploits, demanding a proactive and adaptive security posture to mitigate the inherent risks associated with this expanding attack surface.

Contemporary software development practices, characterized by rapid iteration, microservices, and extensive third-party dependencies, have dramatically outpaced the capabilities of conventional vulnerability detection techniques. Static and dynamic analysis, while foundational, often struggle to accurately model the complex interactions within these sprawling codebases, resulting in a high rate of false positives and, more critically, missed vulnerabilities. This inability to scale effectively with modern software architectures creates significant risk, as attackers require fewer successful exploits to compromise increasingly complex systems. Furthermore, the sheer volume of code and the constant introduction of new features and libraries amplify the challenge, rendering manual review impractical and automated tools less reliable without substantial refinement and integration with intelligent threat modeling.

The relentless surge in reported software vulnerabilities, meticulously cataloged in resources like the Common Vulnerabilities and Exposures (CVE) database, presents a growing challenge for cybersecurity professionals. This constant influx overwhelms security analysts, contributing to a phenomenon known as alert fatigue – where the sheer volume of warnings diminishes their ability to effectively prioritize and address genuine threats. Recent analysis highlights the severity of this issue, revealing that 34 vulnerabilities were discovered within the top 100 most downloaded Python packages on the PyPI repository. This indicates that even commonly used, widely-distributed software components are susceptible to flaws, amplifying the risk for applications and systems that rely on them and underscoring the need for automated vulnerability management solutions and improved software supply chain security.

The Neuro-Symbolic Glimmer: A System’s Attempt at Self-Awareness

The QRS Framework combines Large Language Models (LLMs) and CodeQL to improve vulnerability detection capabilities. LLMs are utilized for hypothesis generation and semantic understanding of code, which guides the creation of focused CodeQL queries. CodeQL, a semantic code analysis engine, then precisely examines the codebase for patterns matching the generated hypotheses. This synergistic approach leverages the LLM’s ability to reason about potential vulnerabilities and the CodeQL’s accuracy in identifying specific code instances, resulting in enhanced detection rates and a reduction in false positives compared to traditional Static Application Security Testing (SAST) methods. Testing demonstrated the framework’s ability to identify 34 Common Vulnerabilities and Exposures (CVEs), including five previously unknown vulnerabilities.

The QRS Framework utilizes a sequential pipeline of three specialized agents to facilitate vulnerability detection. The Query Agent is responsible for generating targeted CodeQL queries based on hypothesized vulnerabilities. Following query execution, the Review Agent analyzes the results, filtering and prioritizing potential vulnerabilities while minimizing false positives. Finally, the Sanitize Agent refines the identified issues, providing a clear and concise report of actionable vulnerabilities with supporting evidence derived from both the LLM and CodeQL analysis. This agent-based approach enables a focused and efficient workflow, combining the strengths of both neuro-symbolic techniques.

The Query Agent within the QRS Framework utilizes hypothesis generation as a primary mechanism for focusing its CodeQL analysis. This process involves formulating specific, testable assertions about potential vulnerabilities within the target codebase. These hypotheses, derived from various sources including vulnerability patterns and LLM-based reasoning, are then translated into precise CodeQL queries. By generating targeted queries based on these hypotheses, the Query Agent avoids broad, inefficient scans and instead concentrates its efforts on areas of code most likely to contain vulnerabilities, increasing the precision and effectiveness of the vulnerability detection process.

The QRS Framework builds upon existing Static Application Security Testing (SAST) techniques to achieve enhanced vulnerability detection and a reduction in false positive results. Evaluation of the framework through testing identified a total of 34 Common Vulnerabilities and Exposures (CVEs); notably, this included the discovery of 5 previously unknown vulnerabilities. This performance demonstrates a statistically significant improvement in detection capabilities compared to traditional SAST solutions, indicating QRS’s ability to uncover a broader range of security flaws with greater accuracy.

The Deep Dive: Unearthing Hidden Flaws in a Complex System

The Review Agent employs Semantic Validation and Data Flow Analysis to assess code behavior by tracing execution paths and analyzing data dependencies. Semantic Validation confirms that code operations are logically consistent and adhere to expected behavior, while Data Flow Analysis tracks the movement and modification of data throughout the code. This combined approach allows the agent to identify vulnerabilities arising from incorrect data handling, such as time-of-check to time-of-use (TOCTOU) errors and Abstract Syntax Notation One (ASN.1) memory exhaustion issues, which are often missed by static analysis due to their runtime dependencies. The system examines how data is used at different points in the execution, identifying potential security flaws based on the flow of information.

The Review Agent employs Data Flow Analysis to identify runtime vulnerabilities that frequently evade static analysis techniques. Specifically, Time-of-Check to Time-of-Use (TOCTOU) vulnerabilities, arising from race conditions, and ASN.1 Memory Exhaustion flaws, stemming from improperly handled data structures, are detectable through tracing data propagation. This dynamic analysis observes how data values change during simulated execution, allowing the system to confirm whether assumptions made during static analysis hold true at runtime. By monitoring data access and modification, the framework can identify instances where data is used after its validity has changed, or where memory allocation fails due to maliciously crafted ASN.1 structures, thus providing detection capabilities beyond those of traditional static code analysis.

The Review Agent incorporates Dependency Analysis to broaden vulnerability detection beyond the code itself, examining external libraries and components for known weaknesses. This process identifies vulnerable dependencies by cross-referencing project dependencies against publicly available vulnerability databases, such as the National Vulnerability Database (NVD). By identifying and flagging vulnerable third-party code, the framework proactively addresses risks introduced through external components, enhancing overall application security and reducing the attack surface. This analysis complements Semantic Validation and Data Flow Analysis, providing a more comprehensive assessment of potential vulnerabilities.

The Review Agent’s detailed semantic analysis demonstrably reduces false positive alerts, mitigating alert fatigue for security teams. Quantitative results indicate a 65% reduction in alert fatigue when compared to conventional security review processes. Performance metrics on the Hist20 dataset further validate this improvement, with the system achieving 90.62% accuracy, 86.96% precision, and 100% recall. These figures suggest a substantial improvement in identifying genuine vulnerabilities while minimizing unproductive investigation of non-issues.

The Inevitable Cascade: A System’s Resilience in the Face of Failure

The QRS Framework significantly bolsters software resilience through its broad vulnerability detection capabilities. It doesn’t simply flag common issues; the system actively seeks out a diverse range of threats, including particularly dangerous exploits like Open Redirects – which can facilitate phishing attacks – and Remote Code Execution vulnerabilities, allowing attackers to take complete control of a system. This proactive approach extends beyond typical security scans, offering a layered defense that strengthens the entire software supply chain. By identifying and mitigating these weaknesses before they can be exploited, the framework minimizes the attack surface and helps organizations maintain the integrity and availability of their applications, ultimately reducing the risk of significant operational disruption and data compromise.

A significant challenge in vulnerability management is the sheer volume of alerts, often burdened by false positives that overwhelm security analysts. The QRS Framework directly addresses this issue by employing advanced filtering and correlation techniques, dramatically reducing the number of inaccurate alerts. This reduction isn’t merely about fewer notifications; it fundamentally alters the workflow for security teams, freeing up valuable time and cognitive resources. Instead of exhaustively investigating non-threats, analysts can concentrate their expertise on verifying and remediating genuine vulnerabilities, ultimately strengthening an organization’s security posture and minimizing the risk of critical exploits.

The QRS Framework doesn’t operate in isolation; rather, it’s designed to seamlessly integrate with Static Application Security Testing (SAST) tools already in use by development and security teams. This interoperability avoids the disruption and expense of completely overhauling existing workflows. By augmenting established SAST pipelines, QRS streamlines vulnerability management, reducing the time and effort required to identify and remediate flaws. This collaborative approach allows security professionals to leverage their current investments while simultaneously benefiting from QRS’s enhanced detection capabilities, accelerating the path from vulnerability discovery to secure code deployment and bolstering overall application resilience.

A robust vulnerability management strategy is paramount in today’s interconnected digital landscape, and a comprehensive approach significantly bolsters the security of the broader software ecosystem. Recent evaluations demonstrate that the QRS framework surpasses conventional tools – including widely adopted solutions like GitHub’s CodeQL, Dependabot, and Copilot – in identifying critical weaknesses. Notably, QRS successfully pinpointed five previously unknown Common Vulnerabilities and Exposures (CVEs) that escaped detection by these established platforms, highlighting its advanced capabilities. This ability to uncover previously hidden vulnerabilities directly translates to reduced risk for organizations, shielding them from potentially devastating financial losses and the erosion of public trust associated with data breaches and security incidents.

The pursuit of automated vulnerability discovery, as demonstrated by QRS, inevitably courts imperfection. The system, in its attempt to synthesize CodeQL queries and iteratively refine detection, doesn’t strive for flawless security, but rather for a dynamic resilience. As Ken Thompson observed, “A system that never breaks is dead.” QRS embraces this truth; its neuro-symbolic approach isn’t about eliminating all vulnerabilities, but about establishing a process for continuous discovery and adaptation. The framework’s iterative refinement mirrors a natural selection of queries, strengthening the system through exposure to edge cases and previously unknown flaws, a testament to the inherent value of controlled failure.

What’s Next?

The pursuit of automated vulnerability discovery, as exemplified by QRS, inevitably shifts the focus from isolated flaws to the vulnerabilities inherent in the automation itself. The system proposes solutions, but each solution introduces new dependencies, new surfaces for future compromise. It splits the problem, but not its fate. The elegance of neuro-symbolic synthesis belies a simple truth: every query, however cleverly generated, is a limited view, a prophecy of what won’t be found.

Future work will undoubtedly explore scaling these systems, broadening their coverage. Yet, the real challenge isn’t quantity, but the increasing opacity of the software ecosystem. As codebases grow and interdependencies proliferate, the very notion of “discovery” becomes suspect. The system may identify existing weaknesses, but it also reshapes the attack surface, creating novel avenues for exploitation that were previously unforeseen.

The path forward isn’t simply more automation, but a deeper understanding of systemic risk. The goal isn’t to eliminate vulnerabilities-that’s a category error-but to build systems that are resilient in the face of inevitable compromise. Everything connected will someday fall together; the task isn’t to prevent the fall, but to design for graceful degradation.

Original article: https://arxiv.org/pdf/2602.09774.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Expanding Fracture: A System’s Inevitable Exposure

The Neuro-Symbolic Glimmer: A System’s Attempt at Self-Awareness

The Deep Dive: Unearthing Hidden Flaws in a Complex System

The Inevitable Cascade: A System’s Resilience in the Face of Failure

What’s Next?

See also: