Author: Denis Avetisyan
New research reveals that the way AI systems are instructed can create critical vulnerabilities, potentially turning email defenses into attack vectors.

Optimizing Large Language Model agents for phishing detection inadvertently introduces exploitable weaknesses susceptible to infrastructure phishing and signal inversion attacks.
While large language models offer promising avenues for automated security, their sensitivity to configuration creates a surprising paradox. This paper, ‘The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities’, demonstrates that prompt engineering is a first-order security variable, with a single model exhibiting phishing bypass rates ranging from under 1% to 97% depending on its configuration. Our analysis of 11 models and 10 prompt strategies reveals that optimizing for detection can inadvertently create brittle vulnerabilities, particularly through āinfrastructure phishingā attacks that exploit assumptions embedded within the prompt itself. Can we navigate the inherent tension between security, usability, and robustness, and ultimately build LLM agents resilient to adversarial manipulation without sacrificing performance?
The Evolving Threat: Beyond Simple Detection
Conventional phishing defenses, historically reliant on techniques like domain and sender authentication, are demonstrating diminishing effectiveness against increasingly resourceful adversaries. Attackers now routinely circumvent these safeguards through methods such as domain spoofing, compromised accounts, and the exploitation of trust relationships. While once sufficient to flag obvious malicious emails, these traditional methods struggle with sophisticated attacks that closely mimic legitimate communications, leveraging techniques like typosquatting and visually similar domain names. This evolving threat landscape necessitates a shift towards more nuanced detection strategies that move beyond simple pattern matching and incorporate behavioral analysis, content understanding, and real-time threat intelligence to accurately identify and neutralize modern phishing attempts.
Infrastructure phishing represents a substantial escalation in cyberattacks, moving beyond simple domain spoofing to meticulously replicate legitimate websites and services. This technique bypasses conventional security measures – those relying on domain name matching or reputation checks – because the imitated infrastructure appears valid. Recent analyses demonstrate the severity of this threat, revealing a dramatic 68.6% reduction in the ability of even highly optimized phishing detection strategies to correctly identify these attacks. The effectiveness of traditional defenses is therefore compromised, as attackers successfully leverage the trust associated with established brands to deceive potential victims and harvest sensitive information. This highlights the urgent need for innovative security solutions that move beyond surface-level checks and focus on behavioral analysis and content-based detection to counter this increasingly prevalent and sophisticated threat.
The growing sophistication of phishing attacks has spurred increased interest in leveraging Large Language Models (LLMs) for email security, though this reliance isn’t without caveats. LLMs offer the potential to analyze email content with a level of semantic understanding previously unattainable, identifying subtle cues indicative of malicious intent that traditional methods miss. However, these models are also susceptible to adversarial attacks, where carefully crafted emails can bypass detection by exploiting the LLMās predictive capabilities. Furthermore, the computational demands and costs associated with deploying and maintaining LLM-powered security systems represent a significant operational hurdle. Therefore, while LLMs present a promising avenue for enhanced phishing defense, successful implementation requires careful consideration of both their capabilities and inherent vulnerabilities, alongside pragmatic assessments of scalability and resource allocation.
Successfully leveraging Large Language Models (LLMs) for phishing detection requires careful calibration between robust security and practical operational demands. While LLMs demonstrate a remarkable capacity to identify subtle linguistic cues indicative of malicious intent, simply maximizing detection rates can introduce substantial false positive rates, overwhelming security teams and disrupting legitimate communications. A nuanced approach therefore prioritizes not only identifying phishing attempts, but also minimizing unnecessary alerts through techniques like prompt engineering, fine-tuning on specific organizational email patterns, and incorporating feedback loops to refine model accuracy. This balance is critical; an overly sensitive system erodes trust and usability, while a lax system leaves organizations vulnerable to increasingly sophisticated attacks. Ultimately, effective LLM-based phishing detection isn’t about achieving 100% accuracy, but about strategically optimizing the trade-off between security, efficiency, and user experience.

Orchestrating LLM Behavior: The Power of Prompts
Effective Large Language Model (LLM) phishing detection is fundamentally reliant on the design of input prompts, specifically the āSystem Promptā. This initial prompt establishes the context, instructions, and constraints for the LLM, dictating how it interprets subsequent user inputs and formulates responses. The System Prompt defines the LLMās role – for example, as a phishing detector – and specifies the criteria for identifying malicious content. Variations in the System Promptās phrasing, detail, and emphasis directly impact the LLMās performance, influencing both its ability to correctly identify phishing attempts (recall) and its tendency to incorrectly flag legitimate communications as malicious (false positives). Consequently, meticulous crafting and iterative refinement of the System Prompt are essential for optimizing LLM-based phishing detection systems.
Prompt strategies for LLM-based phishing detection vary in their prioritization of recall and minimizing false positives. A Baseline Strategy typically focuses on minimal intervention, resulting in high precision but low recall – identifying few phishing attempts while rarely flagging legitimate communication. Conversely, a Security-First Strategy maximizes recall by aggressively flagging potentially malicious content, which inherently increases the rate of false positives. The Balanced Strategy represents a compromise, attempting to optimize both metrics by carefully calibrating the prompt’s instructions and thresholds, and thereby reducing both undetected phishing and unnecessary alerts.
A Balanced Strategy in LLM-based phishing detection seeks to maximize both phishing recall and the minimization of false positives. This approach recognizes the inherent trade-off between identifying all phishing attempts and avoiding the misclassification of legitimate emails as suspicious. Unlike Security-First strategies which prioritize recall at the expense of increased false positives, or Baseline strategies which offer minimal protection, a Balanced Strategy actively tunes the LLM’s responses to achieve an optimal equilibrium. This requires careful prompt engineering to leverage the LLMās capabilities while mitigating its inherent biases and safety settings – the Model Disposition – to ensure a functional and effective security implementation.
Large Language Models (LLMs) possess inherent āModel Dispositionā characterized by pre-defined safety protocols and biases influencing output. These internal settings significantly impact phishing detection performance; without careful consideration, an LLM may exhibit a strong aversion to flagging potentially malicious content, resulting in low recall. However, strategically engineered prompts – variations in phrasing, instruction, and context – can dramatically alter the model’s behavior. Testing demonstrates that prompt engineering is capable of inducing a greater than 90 percentage point swing in phishing recall rates, highlighting the critical importance of prompt optimization to overcome baseline model limitations and achieve desired detection sensitivity.

Beyond Simple Accuracy: A Holistic View of Performance
Effective evaluation of Large Language Model (LLM) performance in phishing detection necessitates the simultaneous assessment of āRecallā and the āFalse Positive Rateā. Recall quantifies the LLMās ability to correctly identify all instances of phishing emails within a given dataset, while the False Positive Rate measures the proportion of legitimate emails incorrectly flagged as phishing. A high recall score alone is insufficient; an LLM could achieve near-perfect identification of phishing attempts but at the cost of excessively flagging benign communication, rendering it impractical for real-world deployment. Therefore, a balanced evaluation considering both metrics is crucial for determining the usability and effectiveness of an LLM in a phishing detection context.
Evaluating Large Language Models (LLMs) for tasks like phishing detection using solely āRecallā – the proportion of actual phishing emails correctly identified – can be problematic. A model maximizing recall might flag a significantly high number of legitimate emails as phishing attempts, resulting in an unacceptable āFalse Positive Rateā. This trade-off means a superficially high recall score doesnāt necessarily indicate a useful or practical system; a model that correctly identifies nearly all phishing emails but also incorrectly flags a substantial percentage of legitimate correspondence creates operational burdens and user frustration. Therefore, focusing on recall in isolation provides an incomplete and potentially misleading assessment of an LLMās performance in real-world applications.
Safetility is a performance metric designed to provide a more comprehensive evaluation of LLM phishing detection than simple accuracy or recall. It functions by applying a penalty to models exhibiting high false positive rates, specifically when those rates exceed a predetermined operational threshold. This addresses the critical trade-off between identifying all phishing attempts and minimizing the misclassification of legitimate emails as malicious. In recent evaluations, Grok 4.1 achieved the highest Safetility score recorded at 96.7%, indicating a strong balance between recall and a low incidence of false positives within the defined operational parameters.
The PhishNChips Benchmark provides a comprehensive, large-scale evaluation of Large Language Model (LLM) performance in phishing email detection, testing various models and prompt engineering techniques. Results indicate that with optimized prompts, the GPT-4o-mini model achieves a phishing recall rate of 93.7% – meaning it correctly identifies 93.7% of phishing emails – while maintaining a false positive rate of 8.2%, indicating that 8.2% of legitimate emails are incorrectly flagged as phishing attempts. This benchmarking data offers a standardized comparison of LLM capabilities in this security application and highlights the impact of prompt optimization on performance metrics.

Beyond Detection: Augmenting LLMs for Adaptive Threat Response
Current large language model (LLM) approaches to phishing detection are markedly enhanced through āTool Augmentationā, a technique that integrates external data sources directly into the analysis process. Rather than relying solely on the text of an email, these systems can now consult resources like domain age databases – identifying suspiciously new websites often used in phishing campaigns – and threat intelligence feeds, which catalogue known malicious URLs and email addresses. This fusion of LLM reasoning with real-world data drastically improves accuracy, allowing for the identification of subtle indicators that would otherwise be missed, and bolstering protection against increasingly sophisticated phishing attacks designed to bypass traditional security measures. The integration moves beyond simple keyword spotting, enabling a contextual understanding of email legitimacy.
Current phishing detection systems often stumble when faced with cleverly disguised attacks, as they primarily analyze the text of an email for keywords or suspicious phrasing. However, large language models (LLMs) offer a powerful opportunity to move beyond this limitation by integrating external data sources. When an LLMās reasoning capabilities are combined with real-world information – such as domain age, sender reputation, or threat intelligence feeds – it can develop a far more comprehensive understanding of an emailās context. This synergistic approach allows the system to assess not just what an email says, but who sent it and where it originated, dramatically improving its ability to identify sophisticated phishing attempts that exploit linguistic subtlety and mimic legitimate communications. The result is a system less reliant on simple pattern matching and more capable of discerning malicious intent based on a holistic evaluation of available evidence.
Current phishing detection often relies heavily on analyzing the text of an email for keywords or suspicious phrasing, a method increasingly circumvented by attackers employing sophisticated language models to craft convincing, yet malicious, communications. However, integrating external data sources allows for a deeper, more contextual understanding of email characteristics. By cross-referencing sender information with domain age databases, threat intelligence feeds, and even analyzing email header anomalies, systems can move beyond superficial textual analysis. This nuanced approach reveals subtle indicators – such as a recently registered domain, a mismatch between displayed sender and actual origin, or a reputation flagged by threat intelligence – that would otherwise be missed. Consequently, detection capabilities extend to previously undetectable phishing campaigns, safeguarding users from increasingly deceptive and targeted attacks.
A truly effective phishing defense transcends simple detection, striving instead for a continuously learning system capable of anticipating and neutralizing evolving threats without hindering essential communication. Current approaches often struggle with the dynamic nature of phishing attacks, necessitating a paradigm shift towards adaptability. This next-generation system aims to integrate real-time threat intelligence, behavioral analysis, and nuanced contextual understanding, enabling it to not only identify malicious emails but also to differentiate between genuine risks and false positives with increasing accuracy. The resulting architecture prioritizes user experience by minimizing disruptions caused by incorrectly flagged messages, fostering trust and ensuring seamless access to legitimate correspondence – a critical balance for any successful security implementation.
“`html
The pursuit of robust LLM security, as detailed in the study of system prompt vulnerabilities, often resembles building increasingly elaborate defenses against threats that exploit inherent simplicity. Itās a curious paradox; the very tools intended to detect malicious intent become susceptible when overcomplicated. Alan Turing observed, āSometimes people who are unhappy tend to look at the world as if there is something wrong with it.ā Similarly, this research reveals that a flawed configuration – a āsomething wrongā with the prompt – invites attack. The paperās findings on infrastructure phishing demonstrate how optimizing for detection can ironically create a more easily exploited system, revealing a truth about complexity: it doesn’t necessarily equate to security, and can, in fact, obscure fundamental weaknesses.
What Lies Ahead?
The demonstrated susceptibility of LLM-based security systems to subtle prompt manipulation suggests a fundamental limitation: optimization for detection, paradoxically, amplifies the attack surface. The pursuit of robustness through increasingly complex prompts invites equally complex exploits. The field now faces a choice: continue layering defenses atop a fundamentally unstable base, or reassess the core principle of relying on linguistic pattern recognition for security. The latter demands a shift toward methods less dependent on semantic understanding – a humbling realization, perhaps, but one that acknowledges the inherent ambiguity of natural language.
The particular threat of infrastructure phishing, where attackers leverage domain consistency, highlights a critical blind spot. Current systems treat domain verification as a positive signal, effectively rewarding attackers who establish legitimate, albeit malicious, infrastructure. Future work must explore methods for actively discrediting seemingly valid signals, a counterintuitive approach that may require a redefinition of ātrustā in the context of automated systems.
Ultimately, the problem isnāt simply detecting malicious content, but discerning intent. This remains a uniquely human capacity. The challenge, therefore, isn’t to replicate it perfectly, but to build systems that gracefully degrade when confronted with ambiguity – systems that prioritize minimizing error over maximizing perceived accuracy. A little admitted ignorance, it turns out, may be the most secure posture of all.
Original article: https://arxiv.org/pdf/2603.25056.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- All Shadow Armor Locations in Crimson Desert
- Jujutsu Kaisen Season 3 Episode 12 Release Date
- Dark Marksman Armor Locations in Crimson Desert
- How to Get the Sunset Reed Armor Set and Hollow Visage Sword in Crimson Desert
- Keeping AI Agents on Track: A New Approach to Reliable Action
- How to Beat Antumbraās Sword (Sanctum of Absolution) in Crimson Desert
- Top 5 Militaristic Civs in Civilization 7
- Best Bows in Crimson Desert
- Sakuga: The Hidden Art Driving Animeās Stunning Visual Revolution!
- Sega Reveals Official Sonic Timeline: From Prehistoric to Modern Era
2026-03-29 08:39