When Disaster Strikes, Can AI Answer the Call?

Author: Denis Avetisyan


A new benchmark assesses how well artificial intelligence systems can provide accurate and complete answers to critical questions during disaster events.

Discrepancies in model rankings between general question answering benchmarks and a disaster-response focused subset-specifically, pronounced deviations from a diagonal correlation-suggest that established leaderboards offer an unreliable indication of relative performance when applied to the critical context of high-stakes disaster scenarios, as detailed in Table 13.
Discrepancies in model rankings between general question answering benchmarks and a disaster-response focused subset-specifically, pronounced deviations from a diagonal correlation-suggest that established leaderboards offer an unreliable indication of relative performance when applied to the critical context of high-stakes disaster scenarios, as detailed in Table 13.

Researchers introduce DisastQA, a comprehensive dataset for evaluating question answering capabilities in disaster management, revealing limitations in current large language models regarding factual completeness and reliable reasoning.

Despite advances in large language models, reliably reasoning with uncertain and conflicting information-critical for effective disaster response-remains a significant challenge. To address this gap, we introduce DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management, a large-scale resource comprising 3,000 rigorously verified questions spanning eight disaster types. Our evaluation reveals substantial performance divergences between general-purpose leaderboards and disaster-specific reasoning, highlighting critical reliability gaps even in recent open-weight models when exposed to realistic noise. Can we develop QA systems that not only answer questions during crises, but also demonstrably reason under pressure with incomplete and potentially misleading data?


The Inevitable Cascade: Information in Crisis

The efficacy of disaster response hinges fundamentally on the swift and reliable dissemination of information. During crises – be they natural disasters, public health emergencies, or industrial accidents – access to accurate details regarding the unfolding situation, available resources, and safety protocols is not merely helpful, but potentially life-saving. Delays or inaccuracies can exacerbate panic, hinder evacuation efforts, and impede the delivery of critical aid. Consequently, robust systems for collecting, verifying, and distributing information are paramount, requiring a coordinated effort between governmental agencies, emergency services, and affected communities. The ability to quickly assess needs, allocate resources effectively, and communicate vital instructions directly impacts the scale of devastation and the speed of recovery, underscoring information access as a cornerstone of effective disaster management.

Conventional question answering systems, while adept at factual recall, frequently falter when confronted with the complex and rapidly evolving nature of crisis-related inquiries. These systems typically rely on pre-defined knowledge bases and struggle to interpret ambiguous language, contextual shifts, or novel situations common during disasters. The nuances of human communication – such as implied meaning, urgent requests for specific assistance, or geographically-specific needs – are often lost in translation, leading to inaccurate or irrelevant responses. This unreliability stems from a lack of adaptability and an inability to process the dynamic information flow inherent in crisis events, potentially hindering effective disaster response and jeopardizing public safety.

The DisastQA pipeline leverages both human refinement of multiple-choice questions and keypoint annotation of open-ended questions, evaluated across varying evidence conditions, to construct a high-quality disaster-related question-answering dataset.
The DisastQA pipeline leverages both human refinement of multiple-choice questions and keypoint annotation of open-ended questions, evaluated across varying evidence conditions, to construct a high-quality disaster-related question-answering dataset.

Constructing a Benchmark Against the Tide

DisastQA is a newly developed benchmark intended to assess the performance of question answering (QA) systems when applied to the domain of disaster management. Unlike general-purpose QA benchmarks, DisastQA focuses specifically on the challenges presented by crisis-related information needs, such as situational awareness, damage assessment, and resource allocation. The benchmark aims to provide a standardized evaluation platform for researchers and developers working on QA systems designed to support disaster response efforts, enabling comparative analysis and driving improvements in system accuracy and reliability within this critical application area.

The DisastQA benchmark is constructed upon the DisastIR collection, resulting in a dataset of realistic, crisis-related questions designed to challenge question answering systems. A key characteristic of DisastQA is the factual complexity of its answers; analysis indicates an average of 4.4 keypoints are required to fully address each question. This metric provides a quantifiable measure of the information density and detail necessary for comprehensive responses within the disaster management domain, distinguishing it from benchmarks with simpler answer requirements and allowing for more nuanced evaluation of system capabilities.

DisastQA employs both open-ended (OE) and multiple-choice question (MCQ) formats to provide a comprehensive assessment of question answering system performance in disaster-related scenarios. The inclusion of OE questions allows for the evaluation of systems’ ability to generate free-form answers, demanding nuanced understanding and synthesis of information. Conversely, the MCQ format enables quantifiable evaluation of accuracy and facilitates statistical analysis across different models. This dual approach ensures a more robust and multifaceted benchmark compared to datasets relying on a single question type, allowing for a detailed understanding of a system’s strengths and weaknesses when addressing crisis-related information needs.

The Measure of Comprehension: Beyond Lexical Overlap

Evaluating factual completeness is paramount in disaster response scenarios due to the critical need for accurate information to guide decision-making and resource allocation. However, automatic evaluation metrics commonly used in natural language generation, such as ROUGE and BLEU, primarily assess lexical overlap with reference texts and are therefore insufficient for determining factual accuracy. These metrics can be easily satisfied by responses that paraphrase source material without necessarily conveying correct or complete information, especially when dealing with complex or nuanced disaster-related events. Consequently, a high ROUGE or BLEU score does not guarantee a factually complete or reliable response, necessitating the development of metrics specifically designed to evaluate the presence and correctness of key factual elements.

Keypoint Coverage, utilized within the DisastQA evaluation framework, measures the extent to which a generated response includes essential factual details relevant to a given disaster scenario. This metric operates by identifying a predefined set of ‘keypoints’ – critical pieces of information necessary for a comprehensive understanding of the situation – and calculating the proportion of these keypoints that are explicitly addressed in the system’s output. The score is determined by comparing the keypoints present in the generated text against those identified in a reference or ‘golden’ response, providing a quantifiable assessment of factual completeness beyond simple lexical overlap as measured by traditional metrics.

In evaluations utilizing the “Golden” context setting – characterized by the provision of complete and accurate information – state-of-the-art language models, including GPT-5.2, consistently achieve Keypoint Coverage scores exceeding 99%. This metric, measuring the proportion of critical facts from a reference text successfully incorporated into a generated response, indicates a capacity for near-perfect factual recall under ideal conditions. The high performance observed suggests that, given comprehensive and accurate input data, current large language models possess the ability to accurately synthesize and present factual information with minimal omission of key details.

Keypoint coverage consistently improves across increasing difficulty levels when using the Mix and Golden strategies compared to the Base strategy.
Keypoint coverage consistently improves across increasing difficulty levels when using the Mix and Golden strategies compared to the Base strategy.

The Shifting Sands of Context: Testing Resilience

The DisastQA benchmark distinguishes itself through a carefully constructed evaluation framework designed to rigorously test question answering models under realistic conditions. It moves beyond simple accuracy metrics by employing three distinct settings: Base, Mix, and Golden. The ‘Base’ setting provides a clean, ideal scenario with only relevant context. However, the ‘Mix’ setting introduces deliberately irrelevant information – termed ‘Retrieval Noise’ – mimicking the challenges of real-world information retrieval. Crucially, the ‘Golden’ setting presents models with exclusively pertinent contextual evidence, establishing an upper performance bound achievable with perfect information. This tiered approach allows researchers to pinpoint precisely where models falter – whether due to inherent limitations or susceptibility to distracting data – and thereby drive improvements in robustness and reliability.

The introduction of deliberately misleading information – termed ‘Retrieval Noise’ – significantly impacts the performance of question-answering systems, as demonstrated by the ‘Mix’ evaluation setting in DisastQA. This setting assesses a model’s ability to discern relevant data amidst irrelevant context, revealing substantial performance declines in models like Qwen-3-8B when faced with such distractions. The results underscore that even advanced language models are susceptible to being misled by extraneous information, highlighting the crucial need for robust retrieval mechanisms and improved context filtering techniques to ensure reliable and accurate responses in real-world applications where data quality is often imperfect.

The efficacy of question answering systems is fundamentally linked to the quality of provided contextual information, as demonstrated by the DisastQA ‘Golden’ evaluation setting. This setting isolates performance by presenting models exclusively with relevant evidence, thereby establishing a ceiling for achievable accuracy. Recent evaluations utilizing this approach reveal that GPT-5.2 currently achieves state-of-the-art results, attaining 93.1% accuracy on multiple-choice question answering tasks. This benchmark underscores that, when relieved of processing extraneous or misleading data, advanced models possess a remarkable capacity for accurate knowledge retrieval and reasoning – highlighting the critical need for effective information filtering and retrieval mechanisms in real-world applications.

The significant performance gap between baseline and golden models on specialized event types-such as biological or extraterrestrial events-demonstrates that current models heavily depend on retrieval mechanisms to access and reason about long-tail knowledge.
The significant performance gap between baseline and golden models on specialized event types-such as biological or extraterrestrial events-demonstrates that current models heavily depend on retrieval mechanisms to access and reason about long-tail knowledge.

Toward Adaptive Systems: A Symbiosis of Machine and Mind

The development of truly reliable crisis information systems hinges on comprehensive evaluation, and DisastQA provides a notable step forward through its rigorous methodology. Unlike traditional question answering benchmarks, DisastQA doesn’t simply assess if an answer contains the right information, but rather if it captures all keypoints relevant to a crisis situation – a metric termed ‘Keypoint Coverage’. This nuanced approach moves beyond superficial accuracy, demanding that systems demonstrate a complete understanding of the event’s critical details. By focusing on comprehensive recall, rather than just precision, DisastQA pushes the boundaries of what constitutes a robust response, offering a pathway toward systems that can consistently deliver the vital information needed during times of emergency and ultimately improving disaster response effectiveness.

Current crisis information systems, while increasingly sophisticated, often struggle with the sheer volume of data generated during emergencies. Future development must prioritize enhancing a model’s capacity to discern critical information from noise, a process requiring not just keyword recognition, but a deeper understanding of context. Effectively leveraging contextual evidence – such as the location of an event, the time elapsed since its occurrence, and corroborating reports from multiple sources – is paramount. This necessitates moving beyond simple question-answering towards systems capable of reasoning about the reliability and relevance of information, ultimately delivering more accurate and actionable insights during critical situations. Such improvements will allow these systems to move past simply identifying facts and towards providing a comprehensive and nuanced understanding of unfolding events.

The future of crisis information systems lies in synergistic partnerships between large language models and human expertise. While LLMs demonstrate remarkable capabilities in processing and synthesizing vast amounts of data, they are not infallible, particularly when faced with the nuanced and rapidly evolving nature of crisis events. Integrating human oversight allows for critical validation of LLM outputs, correction of inaccuracies, and the incorporation of real-world context often missed by automated systems. This collaborative approach-where humans and LLMs work in tandem-not only enhances the reliability and accuracy of crisis information but also fosters a more adaptive and resilient system capable of responding effectively to unforeseen challenges. Such human-in-the-loop systems promise to move beyond simple information retrieval toward genuine understanding and actionable insights during critical incidents.

The pursuit of robust question answering, as exemplified by DisastQA, highlights the ephemeral nature of system reliability. The benchmark’s focus on factual completeness and keypoint coverage underscores a fundamental truth: information, like all systems, decays. Donald Davies observed, “Time is not a metric; it’s the medium in which systems exist.” This resonates deeply with the findings presented, where even advanced Large Language Models exhibit vulnerabilities in high-stakes disaster scenarios. The illusion of stability is maintained only as long as the system’s cached knowledge remains relevant, a precarious balance in the face of rapidly evolving events and incomplete data. Latency, the unavoidable delay in accessing information, represents the tax paid for this temporary coherence.

What Lies Ahead?

The emergence of DisastQA, as with any meticulous record in the annals of disaster response evaluation, merely clarifies the shape of the challenges that remain. Each commit – each iteration of models tested against this benchmark – highlights not a destination achieved, but the growing awareness of what is still imperfect. The current state reveals a predictable pattern: large language models demonstrate increasing fluency, yet consistently falter when pressed for factual completeness and robust reasoning – a tax on ambition, perhaps, for prioritizing scale over substantiation.

Future work must resist the temptation to view this as a problem of data quantity. While more training examples will undoubtedly yield incremental gains, the core issue is one of systemic fragility. The benchmark itself will evolve, of course, demanding not just answers, but demonstrable evidence of their validity, tracing reasoning steps, and quantifying uncertainty.

Ultimately, the true measure of progress will not be how closely machines mimic human response, but how effectively they augment it. The study of human-LLM collaboration, as DisastQA suggests, is not about replacing expertise, but extending its reach. Every version of this benchmark, then, is a chapter in a longer, more complex story-one where graceful decay, not perfect prediction, is the most realistic outcome.


Original article: https://arxiv.org/pdf/2601.03670.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-09 00:09