Advanced AI models can diagnose medical conditions with impressive accuracy when physicians provide them with relevant research papers, but their performance crashes when forced to find that evidence on their own. A new benchmark called CURE reveals this critical weakness in how AI systems handle clinical decision-making, highlighting a fundamental challenge in deploying these tools in real medical settings.
Researchers have developed the Clinical Understanding and Retrieval Evaluation (CURE) benchmark to assess how well multimodal AI systems can both reason about medical cases and retrieve supporting literature. The results expose a stark performance gap that could influence how medical AI is deployed in practice.
CURE evaluates AI models across 500 multimodal clinical cases, each mapped to physician-cited reference literature. Unlike existing benchmarks that focus on end-to-end medical question answering, CURE separates two critical capabilities: the ability to reason through clinical evidence and the ability to find that evidence in the first place.
The benchmark tests state-of-the-art multimodal large language models (MLLMs) in both closed-ended and open-ended diagnosis scenarios. When models receive physician-selected reference materials alongside patient cases, they demonstrate strong clinical reasoning abilities. But when those same models must independently retrieve supporting literature, their diagnostic accuracy plummets dramatically.
The research team, led by Yannian Gu and colleagues from multiple institutions, designed CURE to address limitations in current medical AI evaluation. Most existing benchmarks evaluate models in isolation, without considering how they would perform in real clinical workflows where finding relevant research is as important as interpreting it.
"This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature," the researchers write in their paper, published February 28, 2026.
- AI models show strong diagnostic reasoning when given appropriate references
- Performance drops by nearly 50 percentage points when models must find evidence independently
- The gap exists across both closed-ended and open-ended diagnostic tasks
- Current retrieval mechanisms appear insufficient for clinical literature search
The benchmark reveals that medical AI development has focused heavily on reasoning capabilities while neglecting information retrieval skills. This imbalance could limit the practical deployment of these systems in clinical settings, where physicians regularly consult medical literature to inform their decisions.
Clinical diagnosis inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. The CURE benchmark suggests that while AI has made significant progress on the synthesis component, the literature consultation aspect remains a major weakness.
The researchers have made CURE publicly available to encourage further development in this area. The benchmark provides a standardized way to evaluate both components of clinical AI performance, potentially guiding future research toward more balanced capabilities.
The findings have implications beyond medical AI, highlighting a broader challenge in developing AI systems that can both reason about information and effectively search for it. As these models become more sophisticated in their reasoning abilities, the retrieval bottleneck may become an increasingly important factor in their real-world utility.