Asking the Right Questions: Fine-Tuning Language Models for Clinical Insights

Author: Denis Avetisyan

Researchers demonstrate a system that effectively answers patient-focused clinical questions by leveraging efficient fine-tuning techniques and improved evidence retrieval.

This work details the QU-NLP system’s two-stage QLoRA fine-tuning of Qwen3-4B and lexical-neural ensemble for evidence alignment, achieving competitive performance on the ArchEHR-QA 2026 shared task using the MIMIC-III dataset.

Despite advances in natural language processing, effectively bridging the gap between complex clinical queries and nuanced electronic health record data remains a significant challenge. This paper, ‘QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment’, details a system leveraging two-stage Quantised Low-Rank Adaptation (QLoRA) of the Qwen3-4B language model, coupled with a lexical-neural retrieval ensemble, to address both answer generation and evidence alignment tasks. Achieving competitive performance on the ArchEHR-QA shared task, our results demonstrate the potential of parameter-efficient fine-tuning for clinical NLP, but also highlight the critical need for expanded training datasets-suggesting that data augmentation strategies may be the most impactful path toward robust clinical question answering systems.

The Inevitable Pursuit of Clinical Depth

Effective clinical question answering transcends the limitations of simply locating relevant information; it necessitates a robust capacity for deep reasoning. Unlike traditional information retrieval systems that focus on keyword matching, addressing complex medical inquiries demands the integration of knowledge from multiple sources, the ability to interpret nuanced clinical contexts, and the application of logical inference. A physician doesn’t merely find data; they synthesize patient history, physical exam findings, laboratory results, and current medical literature to formulate a diagnosis and treatment plan. Replicating this process requires artificial intelligence capable of understanding relationships between concepts, handling uncertainty, and drawing conclusions based on incomplete or ambiguous data – a level of cognitive function far beyond the scope of basic search algorithms. Consequently, the pursuit of truly accurate clinical AI necessitates a shift from retrieval-based approaches toward systems grounded in knowledge representation, reasoning, and potentially, even causal inference.

Current clinical decision support systems often falter not because of a lack of data, but due to an inability to synthesize it effectively. These systems typically rely on keyword matching or pre-defined rules, proving inadequate when faced with the complexity of real-world patient cases where crucial information exists across multiple, seemingly unconnected sources – a progress note here, a lab result there, and imaging reports elsewhere. This fragmentation hinders a holistic understanding, leading to answers that may be technically correct but clinically incomplete or even misleading. The challenge lies in developing methods capable of identifying subtle relationships and inferring meaning from this disparate data, ultimately providing clinicians with trustworthy and comprehensive insights that support accurate diagnoses and treatment plans.

Balancing Capacity and Efficiency: The Qwen3-4B Foundation

Qwen3-4B is a language model consisting of 4 billion parameters, representing a relatively compact size within the landscape of large language models. This model architecture was selected as the base for our clinical question answering system due to its balance between performance capabilities and computational efficiency. The 4-billion parameter count allows for a substantial capacity to encode and process information relevant to clinical domains, while remaining manageable for fine-tuning and deployment on available hardware. The model’s architecture incorporates features designed to optimize performance in natural language understanding and generation tasks, making it well-suited for the specific demands of answering complex clinical questions.

QLoRA is a parameter-efficient fine-tuning method that combines Low-Rank Adaptation (LoRA) with 4-bit NormalFloat quantization. LoRA freezes the pre-trained model weights and injects trainable low-rank matrices into each layer, reducing the number of trainable parameters. Coupled with 4-bit NormalFloat quantization, which represents weights using 4 bits, QLoRA significantly reduces the memory footprint of the model during fine-tuning. This allows for adaptation of large language models on hardware with limited resources, such as a single GPU, while maintaining performance comparable to full fine-tuning.

The utilization of QLoRA – combining Low-Rank Adaptation (LoRA) with 4-bit NormalFloat quantization – significantly reduced the computational demands of fine-tuning Qwen3-4B for clinical question answering. By quantizing the model weights to 4-bit precision and introducing trainable low-rank matrices, the number of trainable parameters was substantially decreased. This reduction in parameter count allowed for effective fine-tuning using a single NVIDIA RTX 3090 GPU with 24GB of VRAM, which would be impractical with full fine-tuning of a 4-billion parameter model. The resulting model achieved performance comparable to larger, fully fine-tuned models while requiring considerably fewer computational resources and enabling faster training iterations.

Layered Expertise: A Two-Stage Training Regimen

The initial phase of model training utilized the emrQA-MedSQuAD dataset, a resource specifically designed for question answering within the clinical domain. This pre-training step aimed to impart a foundational understanding of medical terminology, clinical note structure, and common question types encountered in electronic health records. The dataset consists of question-answer pairs derived from real clinical notes, enabling the model to learn associations between questions and their corresponding answers prior to task-specific fine-tuning. By establishing this baseline understanding of clinical language, the subsequent training stages could focus on refining the model’s ability to address more complex reasoning and inference tasks.

Following initial pre-training, the model underwent fine-tuning utilizing the ArchEHR-QA development dataset, a collection specifically designed for evaluating performance on patient-oriented question answering tasks. This dataset consists of clinical notes paired with questions formulated from a patient’s perspective, requiring the model to extract and synthesize information relevant to a typical patient inquiry. The fine-tuning process involved supervised learning, optimizing model parameters to maximize accuracy in answering these patient-focused questions, thereby enhancing its ability to address real-world clinical queries as they might be posed by a patient seeking information about their health or treatment.

The implementation of a two-stage training pipeline resulted in measurable gains in clinical question answering performance. Initial pre-training on the emrQA-MedSQuAD dataset provided a foundational understanding of medical terminology and question formats. Subsequent fine-tuning on the ArchEHR-QA development set, which focuses on complex, patient-centered queries, further refined the model’s ability to extract and synthesize information. Evaluations demonstrated a statistically significant improvement in both response accuracy and relevance compared to models trained using a single-stage approach, indicating that the sequential training process effectively transfers knowledge and enhances performance on clinically relevant tasks.

Measuring Clinical Insight: Performance and Results

To rigorously evaluate the quality and factual grounding of generated answers, the system’s performance in answer generation was assessed using a suite of established metrics. BLEU and ROUGE – specifically ROUGE-L – measured the overlap of n-grams between the generated text and reference answers, providing insights into fluency and content similarity. However, recognizing the limitations of simple overlap metrics, the evaluation also incorporated AlignScore and MEDCON. AlignScore focused on assessing the alignment between the generated answer and the supporting evidence, while MEDCON, a metric designed to evaluate medical claim verification, further gauged factual consistency by verifying the presence of relevant medical concepts. This multi-faceted approach ensured a comprehensive understanding of the system’s ability to not only produce coherent responses, but also to base them firmly on the provided evidence.

The system’s ability to pinpoint relevant evidence within complex medical texts was enhanced through a carefully constructed weighted ensemble for the evidence alignment task. This approach strategically combined the strengths of three distinct retrieval methods: BM25, a bag-of-words technique; TF-IDF, which emphasizes term frequency and inverse document frequency; and a Cross-Encoder, a more sophisticated model capable of understanding contextual relationships. By assigning different weights to each method, the system could leverage the speed of BM25 and TF-IDF with the accuracy of the Cross-Encoder, resulting in a robust and effective evidence retrieval process. Performance was rigorously assessed using the Micro-F1 score, a standard metric for evaluating information retrieval systems, demonstrating significant gains over a baseline model relying solely on the Cross-Encoder.

The system attained a score of 32.87 on the challenging ArchEHR-QA test-2026 split, signifying robust capabilities in both answer generation and evidence retrieval. This performance indicates the model effectively synthesizes information to formulate accurate responses while simultaneously pinpointing the relevant supporting evidence within complex clinical texts. The achieved score reflects a balance between linguistic quality and factual grounding, demonstrating the system’s ability to not only sound correct but also to be reliably supported by the source material – a critical attribute for applications in healthcare and medical research.

The system demonstrated a significant advancement in identifying relevant evidence to support generated answers, achieving a Micro-F1 score of 67.16 for evidence sentence alignment. This score represents a substantial 6.3 point improvement compared to a baseline model utilizing a Cross-Encoder alone, which achieved a score of 60.82. This enhanced performance suggests the weighted ensemble approach effectively combines the strengths of BM25, TF-IDF, and the Cross-Encoder, leading to more accurate and reliable identification of supporting evidence within complex clinical texts. The improvement highlights the efficacy of the ensemble in discerning nuanced relationships between questions and relevant information, ultimately bolstering the system’s ability to provide well-supported answers.

Evaluations conducted on the ArchEHR-QA test-2026 split demonstrate the system’s capacity for generating coherent and relevant responses, as evidenced by achieving approximate BLEU and ROUGE-L scores of 32.87. These metrics, widely used in natural language processing, assess the similarity between the generated text and human-written reference answers, with BLEU focusing on precision and ROUGE-L emphasizing the longest common subsequence. The consistently high scores across both metrics suggest the system not only captures the essential information but also presents it in a manner comparable to human-authored text, indicating a strong level of linguistic fluency and accuracy in its responses.

The system’s performance on the ArchEHR-QA test-2026 split demonstrated a remarkable level of competitiveness, achieving BLEU scores within 0.50 points and ROUGE-L scores within 0.81 points of the highest-performing system. This near state-of-the-art result underscores the efficacy of the implemented approach to answer generation and evidence retrieval. The minimal performance gap highlights a substantial advancement in the system’s ability to not only formulate accurate responses but also to closely mirror the quality and fluency of leading models in the field, suggesting a highly refined and effective natural language processing pipeline.

The pursuit of robust clinical question answering systems, as demonstrated by this work with Qwen3-4B and QLoRA, echoes a fundamental truth about all complex systems: continuous refinement is paramount. The authors navigate the inherent decay of information retrieval through iterative fine-tuning, acknowledging that even the most advanced models require adaptation to maintain efficacy. This resonates with Paul Erdős, who once stated, “A mathematician knows a lot of things, but knows nothing deeply.” Similarly, this system doesn’t claim absolute knowledge, but rather a carefully constructed approximation, constantly updated through versioning and addressing the inevitable drift in clinical data and query patterns. The two-stage approach to QLoRA and evidence alignment isn’t merely about achieving a high score; it’s about building a system that ages gracefully, acknowledging that every commit is a record in the annals, and every version a chapter in its ongoing evolution.

What Lies Ahead?

The presented system, while demonstrating competence, merely sketches a provisional boundary around a far more substantial challenge. The architecture, built upon the foundation of Qwen3-4B and QLoRA, reveals the predictable trade-offs inherent in all such constructions: gains in efficiency achieved through compression inevitably introduce a degree of informational entropy. Every delay in achieving general clinical reasoning is, in effect, the price of understanding these limitations. The system’s reliance on lexical-neural ensembles for evidence alignment, while functional, hints at a persistent fragility – a dependence on surface-level features that may not generalize across the inevitable shifts in medical terminology and documentation practices.

Future work must address the issue of temporal drift. Electronic Health Records are not static repositories, but evolving narratives. A system trained on MIMIC-III today will, inevitably, exhibit diminished performance tomorrow. The true measure of success will not be achieving a high score on a shared task, but building systems capable of continuous adaptation and self-correction. Architecture without a mechanism for historical awareness-an understanding of its own evolution-is, ultimately, ephemeral.

The pursuit of patient-oriented clinical question answering is, fundamentally, a search for a robust and enduring system. The current focus on model size and fine-tuning techniques, while necessary, should not eclipse the more profound questions surrounding knowledge representation, reasoning, and the elusive goal of true clinical understanding.

Original article: https://arxiv.org/pdf/2604.14175.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Pursuit of Clinical Depth

Balancing Capacity and Efficiency: The Qwen3-4B Foundation

Layered Expertise: A Two-Stage Training Regimen

Measuring Clinical Insight: Performance and Results

What Lies Ahead?

See also: