Bridging the Bangla QA Gap: A New Dataset for Educational Question Answering

Author: Denis Avetisyan

Researchers have released a large-scale dataset designed to improve question answering systems for Bangla, addressing the critical need for resources in low-resource languages.

The question set demonstrates a preference for concision, with the majority formulated between seven and twelve words, suggesting an emphasis on directness in knowledge assessment.

This paper introduces NCTB-QA, a comprehensive educational question answering dataset built from Bangla textbooks, and presents benchmark results for extractive QA models.

Reading comprehension systems struggle with unanswerable questions, a critical limitation particularly pronounced in low-resource languages. To address this, we introduce NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance, a new resource comprising over 87,000 question-answer pairs extracted from Bangladeshi national curriculum textbooks and balanced to include a significant proportion of unanswerable questions. Fine-tuning transformer-based models on NCTB-QA demonstrates substantial performance gains-with BERT achieving a 313% relative improvement in F1 score-highlighting the importance of domain-specific data for robust question answering. Will this balanced and challenging benchmark accelerate progress in Bangla NLP and facilitate the development of more reliable educational tools?

The Inevitable Decay of Data: Bridging the Bangla NLP Gap

The remarkable progress in Natural Language Processing (NLP) has not been evenly distributed; languages like Bangla, considered low-resource, continue to lag behind due to a fundamental constraint: data scarcity. Unlike English or Mandarin, which benefit from vast digital corpora, Bangla suffers from a lack of readily available, high-quality text and labeled datasets. This shortage directly impacts the performance of machine learning models, which are notoriously data-hungry; their ability to learn patterns, understand nuances, and generalize effectively is severely limited when trained on insufficient data. Consequently, even sophisticated NLP techniques often struggle to achieve comparable accuracy in Bangla, hindering the development of applications like machine translation, sentiment analysis, and information retrieval. The challenge isn’t simply a matter of quantity; the available Bangla data often lacks the diversity needed to represent the full spectrum of the language, further compounding the issue and demanding innovative approaches to data augmentation and model training.

Current Bangla question answering datasets frequently struggle with representing the nuances of real-world inquiries, limiting their effectiveness in truly assessing a model’s comprehension. Many existing resources prioritize simple, fact-based questions, failing to incorporate the complex reasoning, contextual understanding, and varied linguistic styles prevalent in natural language. This lack of diversity extends to question types – datasets often underrepresent questions requiring inference, comparative analysis, or those with ambiguous phrasing. Consequently, models trained on these limited datasets may exhibit strong performance on narrow benchmarks but falter when confronted with the breadth and complexity of authentic Bangla queries, hindering the development of genuinely robust and versatile QA systems.

Current Bangla natural language processing models frequently struggle with identifying questions that lack answers within a given context, a deficiency that manifests as ‘hallucinations’ – the confident generation of incorrect or fabricated information. This isn’t simply a matter of accuracy; it represents a fundamental limitation in a model’s understanding of its own knowledge. Without robust evaluation metrics specifically designed to test this discernment, systems may present plausible-sounding but entirely untrue responses, undermining trust and reliability. The problem is exacerbated by the nuances of the Bangla language and cultural context, where implicit information and indirect phrasing are common, requiring models to not only process linguistic input but also infer the presence – or absence – of an explicit answer. Addressing this gap demands the creation of specialized datasets and evaluation protocols focused on unanswerable questions, pushing the field toward more truthful and dependable Bangla NLP applications.

The development of dependable Bangla natural language processing applications necessitates overcoming current limitations in data and evaluation metrics. Without robust datasets and the ability to accurately identify unanswerable questions, models risk generating incorrect or fabricated responses – a phenomenon known as ‘hallucination’ – which erodes user trust and hinders practical implementation. Addressing these shortcomings isn’t merely about improving accuracy scores; it’s about establishing a foundation for NLP tools that are consistently reliable, particularly crucial in sensitive domains like healthcare, education, and information access for a vast and diverse linguistic community. Consequently, focused research on data augmentation, nuanced evaluation protocols, and methods for uncertainty estimation are paramount to unlocking the full potential of Bangla NLP and fostering widespread adoption.

The distribution of answerable (blue) and unanswerable (red) questions across subjects in the NCTB-QA dataset reveals consistently similar answerability rates, indicating a balanced challenge across different domains.

NCTB-QA: A Measured Response to Data Scarcity

NCTB-QA is a Bangla question answering dataset constructed from the text of national curriculum textbooks, comprising a total of 87,805 question-answer pairs. This large scale allows for comprehensive training and evaluation of question answering models in the Bangla language. The dataset’s derivation from educational materials ensures a focus on factual knowledge and a diverse range of topics covered within the national curriculum. The substantial size of NCTB-QA facilitates the development of more robust and accurate Bangla language models compared to those trained on smaller datasets.

The generation of question-answer pairs for NCTB-QA utilized the Gemini 2.5 Pro large language model. This model was selected for its capacity to produce semantically accurate and contextually relevant content, crucial for a high-quality question answering dataset. Gemini 2.5 Pro was prompted to generate questions based on the textbook content, and then to provide corresponding answers directly extracted from the source text. The model’s output was then reviewed to ensure fidelity to the original material and adherence to grammatical standards, minimizing inaccuracies and maintaining the integrity of the knowledge base.

The NCTB-QA dataset intentionally incorporates a significant proportion of unanswerable questions to facilitate a more thorough assessment of question answering model capabilities. This design choice moves beyond simple accuracy metrics, allowing researchers to evaluate a model’s ability to abstain from answering when sufficient evidence is absent from the provided context. Specifically, including unanswerable questions assesses a model’s robustness against misleading or incomplete information, and its capacity to avoid generating potentially incorrect responses. The presence of these negative samples is crucial for gauging a model’s calibration and preventing overconfidence in its predictions, ultimately leading to more reliable and trustworthy performance.

The creation of NCTB-QA involved the automated extraction of text from national curriculum textbooks via web scraping techniques. This process facilitated the collection of a large volume of data for question-answer pair generation. Following extraction, all textual content underwent formatting in Markdown to standardize the dataset and improve its usability. Markdown formatting ensured consistent presentation, removed extraneous characters, and enabled easy parsing and integration with various natural language processing tools, ultimately contributing to data cleanliness and accessibility for research and development purposes.

A pipeline processes markdown textbooks by cleaning, segmenting, and thematically grouping the text to generate structured question-answer pairs in JSON format using Gemini.

Evaluating Resilience: Performance on the NCTB-QA Dataset

The performance of three Transformer-based models – BERT, RoBERTa, and ELECTRA – was evaluated using the NCTB-QA dataset. To maximize accuracy, a context extraction technique was implemented prior to model input, identifying and providing relevant passages to each model for question answering. This approach allowed for a focused assessment of each model’s ability to process and understand contextual information within the NCTB-QA domain. The selected models represent a range of Transformer architectures commonly used in natural language processing tasks, enabling a comparative analysis of their effectiveness on this specific question answering benchmark.

The evaluation of Transformer-based models on the NCTB-QA dataset utilized three primary metrics to assess performance. Exact Match (EM) measures the percentage of predictions that exactly match the ground truth answers. The F1 Score calculates the harmonic mean of precision and recall between the predicted and ground truth answers, providing a measure of overlap. Finally, BERTScore leverages pre-trained contextual embeddings from BERT to compute a similarity score between the predicted and reference answers, capturing semantic similarity even when lexical overlap is limited. The combined use of these metrics provides a comprehensive evaluation of both answer accuracy and semantic relevance.

Fine-tuning Transformer-based models on the NCTB-QA dataset yielded a significant performance increase, with the highest F1 score reaching 62.0%. Specifically, the BERT model demonstrated a substantial improvement, achieving an F1 score of 0.620 after fine-tuning, compared to a baseline of 0.150 in a zero-shot configuration. This represents a considerable gain in performance attributable to task-specific adaptation through fine-tuning on the NCTB-QA dataset.

Following fine-tuning on the NCTB-QA dataset, RoBERTa and ELECTRA exhibited significant performance gains. RoBERTa’s F1 score increased to 0.469, representing a 9.6% relative improvement, while ELECTRA achieved an F1 score of 0.550, a 35.5% relative improvement. Notably, all three models – BERT, RoBERTa, and ELECTRA – demonstrated strong performance in identifying unanswerable questions, consistently achieving F1 scores between 0.82 and 0.98 on this subset of the dataset.

Context lengths in the NCTB-QA dataset range from 103 to 1,871 words, with a strong concentration between 150 and 400 words.

Towards Enduring Systems: Implications and Future Directions

The newly created NCTB-QA dataset represents a significant step forward for Bangla Natural Language Processing (NLP). By providing a sizable and meticulously curated collection of question-answer pairs sourced from national curriculum textbooks, it directly addresses the critical lack of resources that has historically hampered progress in this low-resource language. This dataset empowers researchers to train and evaluate question answering models with greater accuracy and reliability, moving beyond the limitations of smaller, less representative datasets. The availability of NCTB-QA not only facilitates the development of more robust systems capable of understanding and responding to Bangla queries, but also serves as a foundation for tackling increasingly complex NLP tasks and ultimately broadening access to information for Bangla speakers.

The inclusion of unanswerable questions within the NCTB-QA dataset represents a significant step toward building more reliable question answering systems for Bangla. Traditionally, models are trained to always produce an answer, even when the provided context lacks sufficient information – a tendency that often leads to fabricated or ‘hallucinatory’ responses. By explicitly incorporating questions without answers, the dataset compels models to develop the crucial ability to recognize informational gaps and abstain from responding. This not only enhances the trustworthiness of Bangla QA systems but also promotes the development of more nuanced evaluation metrics that accurately assess a model’s capacity for both knowledge retrieval and confident uncertainty. The ability to refrain from answering when appropriate is therefore a key advancement, moving beyond simply seeking correct answers to fostering a more responsible and reliable interaction with information.

The development of robust natural language processing (NLP) systems often faces significant hurdles when applied to low-resource languages like Bangla. This research underscores a critical need to proactively tackle data scarcity, not simply as a limitation, but as a central challenge demanding innovative solutions. Targeted dataset creation, such as the NCTB-QA dataset, proves invaluable, focusing on specific domains and question types to maximize impact. Equally important are data augmentation strategies, which can artificially expand the training data by introducing variations and paraphrases. These approaches are not merely stopgaps; they represent a fundamental shift towards building NLP tools for all languages, demonstrating that careful data engineering can overcome resource constraints and unlock the potential of previously underserved linguistic communities.

Advancing Bangla question answering necessitates a shift towards more nuanced computational approaches. Current systems often struggle with questions demanding complex reasoning-those requiring synthesis of information, common-sense inference, or multi-hop deduction. Future studies should therefore investigate techniques like graph neural networks to model relationships between entities, and explore methods for incorporating external knowledge sources, such as Bangla encyclopedias or curated knowledge graphs. Furthermore, research into neuro-symbolic approaches-combining the strengths of neural networks with symbolic reasoning-holds promise for building QA systems capable of not only retrieving relevant information, but also explaining why a particular answer is correct, ultimately enhancing both performance and trustworthiness in this low-resource language context.

The distribution of answer start positions within the NCTB-QA dataset demonstrates that answers are spread throughout the context, effectively minimizing positional bias.

The creation of NCTB-QA speaks to a fundamental truth about all systems: they require constant tending to avoid inevitable decay. This dataset isn’t merely a collection of questions and answers; it’s an attempt to fortify Bangla NLP against the eroding effects of limited resources. As Claude Shannon observed, “Communication is the conveyance of information, not merely its transmission.” NCTB-QA actively conveys knowledge, bridging a critical gap in educational resources and ensuring that Bangla, as a language and a system of learning, doesn’t suffer from the silent degradation of neglect. The dataset’s focus on extractive QA highlights a pragmatic approach, recognizing that preserving existing knowledge is often more effective than attempting wholesale reconstruction. Every failure to provide adequate resources is, in effect, a signal from time-a warning that a system is losing its capacity.

What Lies Ahead?

The creation of NCTB-QA is, inevitably, a moment on the timeline of Bangla NLP. It establishes a baseline, a fixed point against which future models will be measured. But baselines, like all things, are subject to entropy. The dataset’s value isn’t in its current size-though scale is useful-but in its capacity to reveal what remains unseen in Bangla comprehension. The questions it cannot answer will prove more informative than those it does.

The logging of this dataset’s development-the errors, the ambiguities, the linguistic nuances missed during curation-constitutes a chronicle of the challenges inherent in low-resource language processing. Future work must address the inherent biases within any curated resource. Simply expanding the dataset isn’t sufficient; a deeper investigation into question generation-modeling the cognitive processes of educational inquiry-will be crucial.

The true test won’t be achieving higher scores on this benchmark, but in building systems that gracefully degrade as they encounter the inevitable noise and complexity of real-world language. NCTB-QA is not an ending, but an invitation to map the contours of what remains unknown, and to accept that the map itself will never be complete.

Original article: https://arxiv.org/pdf/2603.05462.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Decay of Data: Bridging the Bangla NLP Gap

NCTB-QA: A Measured Response to Data Scarcity

Evaluating Resilience: Performance on the NCTB-QA Dataset

Towards Enduring Systems: Implications and Future Directions

What Lies Ahead?

See also: