The Hidden Weakness in AI Email Security

Author: Denis Avetisyan

New research reveals that the way AI systems are instructed can create critical vulnerabilities, potentially turning email defenses into attack vectors.

A refined configuration for GPT-4o-mini dramatically reduces false positive rates in commodity phishing detection - a 79 percentage point improvement while maintaining 93.7% recall - but simultaneously renders the model highly vulnerable to infrastructure phishing attacks, decreasing recall in that domain by 64 percentage points, demonstrating a critical security trade-off. — A refined configuration for GPT-4o-mini dramatically reduces false positive rates in commodity phishing detection – a 79 percentage point improvement while maintaining 93.7% recall – but simultaneously renders the model highly vulnerable to infrastructure phishing attacks, decreasing recall in that domain by 64 percentage points, demonstrating a critical security trade-off.

Optimizing Large Language Model agents for phishing detection inadvertently introduces exploitable weaknesses susceptible to infrastructure phishing and signal inversion attacks.

While large language models offer promising avenues for automated security, their sensitivity to configuration creates a surprising paradox. This paper, ‘The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities’, demonstrates that prompt engineering is a first-order security variable, with a single model exhibiting phishing bypass rates ranging from under 1% to 97% depending on its configuration. Our analysis of 11 models and 10 prompt strategies reveals that optimizing for detection can inadvertently create brittle vulnerabilities, particularly through ‘infrastructure phishing’ attacks that exploit assumptions embedded within the prompt itself. Can we navigate the inherent tension between security, usability, and robustness, and ultimately build LLM agents resilient to adversarial manipulation without sacrificing performance?

The Evolving Threat: Beyond Simple Detection

Conventional phishing defenses, historically reliant on techniques like domain and sender authentication, are demonstrating diminishing effectiveness against increasingly resourceful adversaries. Attackers now routinely circumvent these safeguards through methods such as domain spoofing, compromised accounts, and the exploitation of trust relationships. While once sufficient to flag obvious malicious emails, these traditional methods struggle with sophisticated attacks that closely mimic legitimate communications, leveraging techniques like typosquatting and visually similar domain names. This evolving threat landscape necessitates a shift towards more nuanced detection strategies that move beyond simple pattern matching and incorporate behavioral analysis, content understanding, and real-time threat intelligence to accurately identify and neutralize modern phishing attempts.

Infrastructure phishing represents a substantial escalation in cyberattacks, moving beyond simple domain spoofing to meticulously replicate legitimate websites and services. This technique bypasses conventional security measures – those relying on domain name matching or reputation checks – because the imitated infrastructure appears valid. Recent analyses demonstrate the severity of this threat, revealing a dramatic 68.6% reduction in the ability of even highly optimized phishing detection strategies to correctly identify these attacks. The effectiveness of traditional defenses is therefore compromised, as attackers successfully leverage the trust associated with established brands to deceive potential victims and harvest sensitive information. This highlights the urgent need for innovative security solutions that move beyond surface-level checks and focus on behavioral analysis and content-based detection to counter this increasingly prevalent and sophisticated threat.

The growing sophistication of phishing attacks has spurred increased interest in leveraging Large Language Models (LLMs) for email security, though this reliance isn’t without caveats. LLMs offer the potential to analyze email content with a level of semantic understanding previously unattainable, identifying subtle cues indicative of malicious intent that traditional methods miss. However, these models are also susceptible to adversarial attacks, where carefully crafted emails can bypass detection by exploiting the LLM’s predictive capabilities. Furthermore, the computational demands and costs associated with deploying and maintaining LLM-powered security systems represent a significant operational hurdle. Therefore, while LLMs present a promising avenue for enhanced phishing defense, successful implementation requires careful consideration of both their capabilities and inherent vulnerabilities, alongside pragmatic assessments of scalability and resource allocation.

Successfully leveraging Large Language Models (LLMs) for phishing detection requires careful calibration between robust security and practical operational demands. While LLMs demonstrate a remarkable capacity to identify subtle linguistic cues indicative of malicious intent, simply maximizing detection rates can introduce substantial false positive rates, overwhelming security teams and disrupting legitimate communications. A nuanced approach therefore prioritizes not only identifying phishing attempts, but also minimizing unnecessary alerts through techniques like prompt engineering, fine-tuning on specific organizational email patterns, and incorporating feedback loops to refine model accuracy. This balance is critical; an overly sensitive system erodes trust and usability, while a lax system leaves organizations vulnerable to increasingly sophisticated attacks. Ultimately, effective LLM-based phishing detection isn’t about achieving 100% accuracy, but about strategically optimizing the trade-off between security, efficiency, and user experience.

Signal-based phishing detection strategies (<span class="katex-eq" data-katex-display="false">sender\_url\_match</span>, <span class="katex-eq" data-katex-display="false">trap\_sender\_match</span>) experience a 41-46 percentage point performance drop when faced with commodity or infrastructure phishing attacks, while baseline and security-first strategies remain robust or even improve. — Signal-based phishing detection strategies ( $sender\_url\_match$ , $trap\_sender\_match$ ) experience a 41-46 percentage point performance drop when faced with commodity or infrastructure phishing attacks, while baseline and security-first strategies remain robust or even improve.

Orchestrating LLM Behavior: The Power of Prompts

Effective Large Language Model (LLM) phishing detection is fundamentally reliant on the design of input prompts, specifically the ‘System Prompt’. This initial prompt establishes the context, instructions, and constraints for the LLM, dictating how it interprets subsequent user inputs and formulates responses. The System Prompt defines the LLM’s role – for example, as a phishing detector – and specifies the criteria for identifying malicious content. Variations in the System Prompt’s phrasing, detail, and emphasis directly impact the LLM’s performance, influencing both its ability to correctly identify phishing attempts (recall) and its tendency to incorrectly flag legitimate communications as malicious (false positives). Consequently, meticulous crafting and iterative refinement of the System Prompt are essential for optimizing LLM-based phishing detection systems.

Prompt strategies for LLM-based phishing detection vary in their prioritization of recall and minimizing false positives. A Baseline Strategy typically focuses on minimal intervention, resulting in high precision but low recall – identifying few phishing attempts while rarely flagging legitimate communication. Conversely, a Security-First Strategy maximizes recall by aggressively flagging potentially malicious content, which inherently increases the rate of false positives. The Balanced Strategy represents a compromise, attempting to optimize both metrics by carefully calibrating the prompt’s instructions and thresholds, and thereby reducing both undetected phishing and unnecessary alerts.

A Balanced Strategy in LLM-based phishing detection seeks to maximize both phishing recall and the minimization of false positives. This approach recognizes the inherent trade-off between identifying all phishing attempts and avoiding the misclassification of legitimate emails as suspicious. Unlike Security-First strategies which prioritize recall at the expense of increased false positives, or Baseline strategies which offer minimal protection, a Balanced Strategy actively tunes the LLM’s responses to achieve an optimal equilibrium. This requires careful prompt engineering to leverage the LLM’s capabilities while mitigating its inherent biases and safety settings – the Model Disposition – to ensure a functional and effective security implementation.

Large Language Models (LLMs) possess inherent ‘Model Disposition’ characterized by pre-defined safety protocols and biases influencing output. These internal settings significantly impact phishing detection performance; without careful consideration, an LLM may exhibit a strong aversion to flagging potentially malicious content, resulting in low recall. However, strategically engineered prompts – variations in phrasing, instruction, and context – can dramatically alter the model’s behavior. Testing demonstrates that prompt engineering is capable of inducing a greater than 90 percentage point swing in phishing recall rates, highlighting the critical importance of prompt optimization to overcome baseline model limitations and achieve desired detection sensitivity.

Model vulnerability to infrastructure phishing attacks (<span class="katex-eq" data-katex-display="false">\Delta \leq 3</span> pp for immune, 3-20 pp for resistant, and >20 pp for collapsed) varies significantly, with Qwen 3 exhibiting strong resistance and GPT-4o-mini and Gemini 3 Flash being highly susceptible. — Model vulnerability to infrastructure phishing attacks ( $\Delta \leq 3$ pp for immune, 3-20 pp for resistant, and >20 pp for collapsed) varies significantly, with Qwen 3 exhibiting strong resistance and GPT-4o-mini and Gemini 3 Flash being highly susceptible.

Beyond Simple Accuracy: A Holistic View of Performance

Effective evaluation of Large Language Model (LLM) performance in phishing detection necessitates the simultaneous assessment of ‘Recall’ and the ‘False Positive Rate’. Recall quantifies the LLM’s ability to correctly identify all instances of phishing emails within a given dataset, while the False Positive Rate measures the proportion of legitimate emails incorrectly flagged as phishing. A high recall score alone is insufficient; an LLM could achieve near-perfect identification of phishing attempts but at the cost of excessively flagging benign communication, rendering it impractical for real-world deployment. Therefore, a balanced evaluation considering both metrics is crucial for determining the usability and effectiveness of an LLM in a phishing detection context.

Evaluating Large Language Models (LLMs) for tasks like phishing detection using solely ‘Recall’ – the proportion of actual phishing emails correctly identified – can be problematic. A model maximizing recall might flag a significantly high number of legitimate emails as phishing attempts, resulting in an unacceptable ‘False Positive Rate’. This trade-off means a superficially high recall score doesn’t necessarily indicate a useful or practical system; a model that correctly identifies nearly all phishing emails but also incorrectly flags a substantial percentage of legitimate correspondence creates operational burdens and user frustration. Therefore, focusing on recall in isolation provides an incomplete and potentially misleading assessment of an LLM’s performance in real-world applications.

Safetility is a performance metric designed to provide a more comprehensive evaluation of LLM phishing detection than simple accuracy or recall. It functions by applying a penalty to models exhibiting high false positive rates, specifically when those rates exceed a predetermined operational threshold. This addresses the critical trade-off between identifying all phishing attempts and minimizing the misclassification of legitimate emails as malicious. In recent evaluations, Grok 4.1 achieved the highest Safetility score recorded at 96.7%, indicating a strong balance between recall and a low incidence of false positives within the defined operational parameters.

The PhishNChips Benchmark provides a comprehensive, large-scale evaluation of Large Language Model (LLM) performance in phishing email detection, testing various models and prompt engineering techniques. Results indicate that with optimized prompts, the GPT-4o-mini model achieves a phishing recall rate of 93.7% – meaning it correctly identifies 93.7% of phishing emails – while maintaining a false positive rate of 8.2%, indicating that 8.2% of legitimate emails are incorrectly flagged as phishing attempts. This benchmarking data offers a standardized comparison of LLM capabilities in this security application and highlights the impact of prompt optimization on performance metrics.

Optimized strategies cluster models like Grok 4.1 (90.7%), GPT-5.2 (87.2%), and GPT-4o-mini (87.1%) in a high-recall, low-false positive region, maximizing their overall Safetility.

Beyond Detection: Augmenting LLMs for Adaptive Threat Response

Current large language model (LLM) approaches to phishing detection are markedly enhanced through ‘Tool Augmentation’, a technique that integrates external data sources directly into the analysis process. Rather than relying solely on the text of an email, these systems can now consult resources like domain age databases – identifying suspiciously new websites often used in phishing campaigns – and threat intelligence feeds, which catalogue known malicious URLs and email addresses. This fusion of LLM reasoning with real-world data drastically improves accuracy, allowing for the identification of subtle indicators that would otherwise be missed, and bolstering protection against increasingly sophisticated phishing attacks designed to bypass traditional security measures. The integration moves beyond simple keyword spotting, enabling a contextual understanding of email legitimacy.

Current phishing detection systems often stumble when faced with cleverly disguised attacks, as they primarily analyze the text of an email for keywords or suspicious phrasing. However, large language models (LLMs) offer a powerful opportunity to move beyond this limitation by integrating external data sources. When an LLM’s reasoning capabilities are combined with real-world information – such as domain age, sender reputation, or threat intelligence feeds – it can develop a far more comprehensive understanding of an email’s context. This synergistic approach allows the system to assess not just what an email says, but who sent it and where it originated, dramatically improving its ability to identify sophisticated phishing attempts that exploit linguistic subtlety and mimic legitimate communications. The result is a system less reliant on simple pattern matching and more capable of discerning malicious intent based on a holistic evaluation of available evidence.

Current phishing detection often relies heavily on analyzing the text of an email for keywords or suspicious phrasing, a method increasingly circumvented by attackers employing sophisticated language models to craft convincing, yet malicious, communications. However, integrating external data sources allows for a deeper, more contextual understanding of email characteristics. By cross-referencing sender information with domain age databases, threat intelligence feeds, and even analyzing email header anomalies, systems can move beyond superficial textual analysis. This nuanced approach reveals subtle indicators – such as a recently registered domain, a mismatch between displayed sender and actual origin, or a reputation flagged by threat intelligence – that would otherwise be missed. Consequently, detection capabilities extend to previously undetectable phishing campaigns, safeguarding users from increasingly deceptive and targeted attacks.

A truly effective phishing defense transcends simple detection, striving instead for a continuously learning system capable of anticipating and neutralizing evolving threats without hindering essential communication. Current approaches often struggle with the dynamic nature of phishing attacks, necessitating a paradigm shift towards adaptability. This next-generation system aims to integrate real-time threat intelligence, behavioral analysis, and nuanced contextual understanding, enabling it to not only identify malicious emails but also to differentiate between genuine risks and false positives with increasing accuracy. The resulting architecture prioritizes user experience by minimizing disruptions caused by incorrectly flagged messages, fostering trust and ensuring seamless access to legitimate correspondence – a critical balance for any successful security implementation.

“`html

The pursuit of robust LLM security, as detailed in the study of system prompt vulnerabilities, often resembles building increasingly elaborate defenses against threats that exploit inherent simplicity. It’s a curious paradox; the very tools intended to detect malicious intent become susceptible when overcomplicated. Alan Turing observed, “Sometimes people who are unhappy tend to look at the world as if there is something wrong with it.” Similarly, this research reveals that a flawed configuration – a ‘something wrong’ with the prompt – invites attack. The paper’s findings on infrastructure phishing demonstrate how optimizing for detection can ironically create a more easily exploited system, revealing a truth about complexity: it doesn’t necessarily equate to security, and can, in fact, obscure fundamental weaknesses.

What Lies Ahead?

The demonstrated susceptibility of LLM-based security systems to subtle prompt manipulation suggests a fundamental limitation: optimization for detection, paradoxically, amplifies the attack surface. The pursuit of robustness through increasingly complex prompts invites equally complex exploits. The field now faces a choice: continue layering defenses atop a fundamentally unstable base, or reassess the core principle of relying on linguistic pattern recognition for security. The latter demands a shift toward methods less dependent on semantic understanding – a humbling realization, perhaps, but one that acknowledges the inherent ambiguity of natural language.

The particular threat of infrastructure phishing, where attackers leverage domain consistency, highlights a critical blind spot. Current systems treat domain verification as a positive signal, effectively rewarding attackers who establish legitimate, albeit malicious, infrastructure. Future work must explore methods for actively discrediting seemingly valid signals, a counterintuitive approach that may require a redefinition of ‘trust’ in the context of automated systems.

Ultimately, the problem isn’t simply detecting malicious content, but discerning intent. This remains a uniquely human capacity. The challenge, therefore, isn’t to replicate it perfectly, but to build systems that gracefully degrade when confronted with ambiguity – systems that prioritize minimizing error over maximizing perceived accuracy. A little admitted ignorance, it turns out, may be the most secure posture of all.

Original article: https://arxiv.org/pdf/2603.25056.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Evolving Threat: Beyond Simple Detection

Orchestrating LLM Behavior: The Power of Prompts

Beyond Simple Accuracy: A Holistic View of Performance

Beyond Detection: Augmenting LLMs for Adaptive Threat Response

What Lies Ahead?

See also: