Can AI Crack Cryptography? – Investment Policy

Author: Denis Avetisyan

A new dataset is challenging large language models to move beyond conceptual understanding and tackle the rigorous demands of cryptographic problem-solving.

Researchers introduce CryptoQA, a large-scale question-answering benchmark designed to evaluate and improve AI performance on tasks requiring formal reasoning and mathematical accuracy in the field of cryptography.

Despite advances in natural language processing, large language models consistently struggle with tasks demanding rigorous mathematical analysis and formal reasoning. To address this limitation, we introduce CryptoQA: A Large-scale Question-answering Dataset for AI-assisted Cryptography, a comprehensive benchmark comprising over two million question-answer pairs curated from cryptographic literature. Our analysis of fifteen state-of-the-art LLMs using CryptoQA reveals substantial performance gaps, particularly in areas requiring precise mathematical knowledge and logical deduction. This highlights an urgent need for specialized LLM tools tailored to support cryptography research-and begs the question of how we can best leverage AI to fortify the foundations of secure communication.

The Gap Between Fluency and Formal Reasoning

Large language models demonstrate remarkable proficiency in manipulating and generating human language, yet their strengths diverge significantly from the demands of cryptography. While adept at recognizing patterns in text, cryptographic tasks necessitate a level of precise mathematical and logical deduction that frequently challenges these models. Cryptography isn’t simply about identifying keywords or stylistic elements; it relies on the consistent application of formal rules and the ability to reason about abstract concepts like prime numbers, modular arithmetic, and encryption algorithms – areas where LLMs, trained primarily on statistical correlations within text, often falter. This disparity arises because LLMs excel at associative reasoning – finding relationships between concepts – but struggle with deductive reasoning, where conclusions must follow logically and inevitably from established axioms and principles. Consequently, even seemingly simple cryptographic problems can expose fundamental limitations in an LLM’s ability to perform reliable, error-free computations, highlighting a crucial gap between linguistic fluency and genuine mathematical understanding.

Conventional Large Language Models, despite their proficiency in processing natural language, frequently falter when confronted with cryptographic challenges that demand rigorous multi-step deduction. These models often struggle to consistently apply fundamental cryptographic principles, leading to errors in reasoning that would be unacceptable in security-sensitive contexts. Unlike tasks where statistical patterns suffice, cryptography requires precise logical inference; a single misapplied rule or overlooked detail can invalidate an entire proof or security argument. This isn’t merely a matter of lacking knowledge, but a limitation in the models’ ability to execute complex, sequential reasoning with absolute accuracy – a critical distinction that highlights the gap between linguistic fluency and formal, mathematical rigor. Consequently, LLMs often produce plausible-sounding but ultimately incorrect cryptographic analyses, underscoring the need for specialized training and evaluation methods.

Assessing the cryptographic competence of Large Language Models requires a departure from conventional Natural Language Processing evaluation methods. Standard benchmarks, designed for fluency and grammatical correctness, fail to capture the critical need for mathematical precision and logical validity inherent in cryptographic reasoning. A single factual error in a cryptographic context – such as an incorrect application of a cryptographic principle or a flawed calculation – can have severe consequences, rendering an entire analysis useless. Consequently, specialized datasets, meticulously crafted to test specific cryptographic concepts, and novel metrics focused on correctness – rather than simply coherence – are essential. These metrics must prioritize identifying even subtle errors that would be inconsequential in general language tasks, but are catastrophic when dealing with secure systems and protocols, demanding a far more rigorous and fault-intolerant evaluation paradigm.

Introducing CryptoQA: A Benchmark for Cryptographic Reasoning

The CryptoQA dataset is a large-scale benchmark comprised of over 2,000,000 question-answer pairs designed for evaluating Large Language Models (LLMs) specifically on cryptographic topics. Rigorous curation processes were employed to ensure the quality and relevance of the data, focusing on the complexities inherent in cryptographic reasoning. This substantial size allows for more robust evaluation metrics and a greater capacity to test LLM performance across a wide range of cryptographic concepts and problem types, exceeding the scale of previously available datasets for this domain.

The CryptoQA dataset was generated utilizing the DeepSeek-V3 large language model to specifically target questions and answers pertaining to cryptography. This approach ensures the dataset’s content is highly relevant to the domain, moving beyond general knowledge benchmarks. The use of DeepSeek-V3 also facilitated the creation of complex questions requiring more than simple information retrieval, incorporating concepts and nuances specific to cryptographic principles and practices. This focus on cryptographic concepts distinguishes CryptoQA from broader QA datasets and provides a more rigorous assessment of LLM performance in this specialized field.

The CryptoQA dataset is designed to assess three core Large Language Model (LLM) capabilities critical for cryptographic applications: factual accuracy, mathematical reasoning, and consistency. Factual accuracy is evaluated through questions requiring recall of cryptographic definitions, protocols, and historical context. Mathematical reasoning is tested via problems involving cryptographic calculations, such as modular arithmetic, prime number factorization, and elliptic curve operations. Consistency is assessed by presenting variations of the same problem or question to determine if the LLM provides coherent and logically sound responses. Reliable performance in these areas is paramount, as inaccuracies or inconsistencies in cryptographic reasoning can lead to vulnerabilities and security breaches in real-world implementations.

Qwen2.5-72B-Instruct: Demonstrating Enhanced Cryptographic Capacity

Qwen2.5-72B-Instruct exhibits notable capabilities in cryptographic reasoning, as evidenced by its performance on the CryptoQA dataset. This dataset, specifically designed to assess reasoning about cryptographic primitives and protocols, served as a benchmark for evaluating the model’s ability to understand and apply cryptographic concepts. Initial evaluations demonstrate that the model can accurately answer questions requiring analysis of cryptographic systems, indicating a foundational understanding of the domain. Performance metrics on CryptoQA suggest the model effectively processes information related to cryptographic algorithms, security vulnerabilities, and protocol implementations, establishing its potential for applications in areas like automated security analysis and cryptographic education.

Fine-tuning the Qwen2.5-72B-Instruct large language model (LLM) using the CryptoQA dataset yielded a measurable performance increase of 7 to 13 percent. This improvement, assessed through standard evaluation metrics on the CryptoQA benchmark, indicates the dataset’s efficacy for specialized refinement of cryptographic reasoning capabilities within the LLM. The observed gains demonstrate that targeted training with domain-specific data, like CryptoQA, can significantly enhance an LLM’s performance on complex, specialized tasks beyond general language understanding.

Comprehensive evaluation of large language models, such as Qwen2.5-72B-Instruct, necessitates a multi-faceted approach beyond automated metrics. Human evaluation, involving expert assessment of model outputs for correctness, coherence, and relevance, provides critical qualitative insights. Furthermore, Reinforcement Learning from Human Feedback (RLHF) is a key technique for aligning model behavior with human preferences. This process involves training a reward model based on human feedback data, which is then used to optimize the LLM’s policy through reinforcement learning. The iterative cycle of human evaluation and RLHF is essential not only for performance improvement but also for ensuring the model’s reliability, safety, and trustworthiness in real-world applications, mitigating potential biases and harmful outputs.

Beyond Accuracy: The Pursuit of Robust and Reliable Reasoning

Beyond simply achieving high accuracy on benchmark datasets, a comprehensive evaluation of Large Language Models (LLMs) demands rigorous testing of their robustness – their ability to maintain reliable performance when confronted with intentionally deceptive or manipulative inputs. Adversarial prompts, carefully crafted to exploit subtle vulnerabilities in an LLM’s reasoning or knowledge base, can reveal unexpected failures even in models that otherwise appear proficient. These prompts might involve paraphrasing questions in misleading ways, introducing irrelevant information, or subtly altering the context to nudge the model towards an incorrect conclusion. Identifying such weaknesses is paramount, particularly as LLMs are increasingly deployed in security-sensitive applications where even a single compromised response could have significant consequences; therefore, proactive testing with adversarial inputs is essential to building trustworthy and resilient AI systems.

The evaluation of large language models extends beyond simply measuring correct answers; a critical component involves deliberately probing for vulnerabilities. Researchers are increasingly employing techniques such as adversarial prompting – carefully crafting inputs designed to mislead the model or expose hidden biases – to assess robustness. Simultaneously, tests of backward reasoning – where the model must justify its conclusions by tracing back through its logical steps – reveal potential flaws in its inferential capabilities. These rigorous evaluations are not merely academic exercises; they are essential for identifying weaknesses that could be exploited in real-world applications, particularly those demanding high reliability and security, and ultimately guide the development of more resilient and trustworthy artificial intelligence systems.

For large language models to gain traction in security-sensitive fields like cryptography, consistent performance is paramount; a model offering varied outputs to logically equivalent prompts erodes user confidence and introduces unacceptable risk. Researchers are exploring methods to enhance this reliability, with particular focus on providing explicit source context through a technique called DOI Prompting. This involves supplementing queries with the Digital Object Identifier-a unique identifier for academic papers-effectively grounding the model’s response in verifiable information. By anchoring the LLM to established knowledge, DOI Prompting mitigates the potential for hallucination or inconsistent reasoning, fostering a more dependable and trustworthy system crucial for applications where precision and repeatability are non-negotiable.

The creation of CryptoQA exemplifies a commitment to paring back complexity in the pursuit of clarity. This dataset doesn’t merely present cryptographic problems; it distills them into a format accessible for evaluation by large language models, exposing both conceptual strengths and, crucially, limitations in formal reasoning. As Tim Bern-Lee observed, “The Web as I envisaged it, we have not seen it yet. The future is still so much bigger than the past.” Similarly, CryptoQA isn’t a final solution, but a crucial step toward realizing the full potential of AI assistance in a field demanding absolute precision – a deliberate reduction toward a more useful, and ultimately, more powerful system.

What Remains to be Seen?

The creation of CryptoQA exposes, rather than resolves. A dataset, however large, merely illuminates the shape of ignorance. The observed aptitude of large language models for conceptualizing cryptography, contrasted with their failures in executing formal reasoning, is not a surprising deficiency – it is the expected state. A system that requires prompting to avoid arithmetic error was flawed from the outset. The challenge isn’t to teach these models cryptography, but to demand of them a different foundation entirely.

Future work will undoubtedly explore scaling these models, refining the dataset, and engineering more elaborate prompts. These are, however, exercises in diminishing returns. True progress lies in acknowledging the fundamental limitations of pattern recognition as a substitute for logical deduction. The pursuit of ‘AI assistance’ risks becoming a euphemism for automating error.

The ultimate test will not be whether a machine can answer cryptographic questions, but whether it can refrain from asking them. A truly intelligent system, confronted with a well-defined problem, should, ideally, remain silent – for the solution already exists within the axioms. Clarity is, after all, a form of courtesy.

Original article: https://arxiv.org/pdf/2512.02625.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Gap Between Fluency and Formal Reasoning

Introducing CryptoQA: A Benchmark for Cryptographic Reasoning

Qwen2.5-72B-Instruct: Demonstrating Enhanced Cryptographic Capacity

Beyond Accuracy: The Pursuit of Robust and Reliable Reasoning

What Remains to be Seen?

See also: