SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

¹ETH Zurich ²Stanford University ³Lund University
⁴Jawaharlal Institute of Postgraduate Medical Education and Research ⁵UCSF
⁶University of Zurich ⁷University Hospital Zurich ⁸HOPPR

^*Equal Contribution ^†Co-senior Authors

Abstract

Multimodal in-context learning (ICL) remains underexplored despite significant potential for specialized domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts contributed problems, each including (1) a multimodal query and (2) multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. Benchmark evaluation of 15 MLLMs reveals that most exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. Analysis reveals the importance of selecting relevant in-context examples: one noisy or irrelevant example can result in average performance reductions of up to 9.5%. We also identify a recency bias in MLLMs, where placing the most relevant example last in the example list can lead to substantial improvements in performance. SMMILE thus highlights critical limitations and biases in current MLLMs when learning multimodal medical tasks from context.

How is SMMILE created?

👥 Expert Collaboration

Stanford Multimodal Medical In-Context Learning (SMMILE) was developed through a collaborative effort with an international team of 11 medical experts from leading institutions with an average of 6.4 years of clinical experience. The experts have specialty expertise in radiology, general medicine, and pathology. Our expert-driven approach ensures clinical relevance and accuracy of the benchmark problems.

🔧 Problem Creation Process

Each expert contributed 10 problems through a guided, step-by-step web interface. The problems span 6 medical specialties and 13 imaging modalities, ensuring comprehensive coverage of medical visual reasoning tasks. Each problem includes multimodal in-context examples followed by a query that tests the model's ability to learn from the provided demonstrations.

✅ Quality Control

Every problem underwent rigorous manual quality control, including inspection by multiple authors, categorization, and spell-checking. This resulted in a high-quality dataset with consistent formatting and clinical accuracy. The final dataset contains 111 problems with 517 in-context examples.

Examples of Problems

📋 Problem Structure

SMMILE problems consist of multimodal in-context examples followed by a query that tests the model's ability to learn from the provided demonstrations. Each problem includes:

In-context Examples: Expert-curated in-context learning problems that demonstrate the task
Query: A final image-question pair that the model must answer based on the learned pattern
Ground Truth: Expert-validated answers for evaluation

🏥 Medical Coverage

Medical Specialties Covered: Radiology, Pathology, Dermatology, Ophthalmology, Surgery, and General Medicine

Imaging Modalities: X-Ray, CT, MRI, Ultrasound, Photographs, Staining, ECG, EEG, Mammogram, Fundus Photography, and more

Task Types: Classification problems, diagnostic questions, reasoning tasks, and quantitative analysis requiring various cognitive processes from pattern recognition to complex medical reasoning.

Examples of SMMILE problems showing multimodal in-context learning tasks across different medical specialties and imaging modalities.

Results

🔍 Evaluation Overview

We evaluated 15 state-of-the-art multimodal large language models on SMMILE, including both open-source and closed-source models. Our findings reveal significant limitations in current MLLMs' ability to perform multimodal in-context learning in medical settings.

📊 Key Findings

Limited ICL Benefits: Most MLLMs show minimal improvement from in-context learning, with only an average 8% improvement on SMMILE and 9.4% on SMMILE++.

Model Performance: GPT-4o emerged as the overall leader on SMMILE, while Qwen2.5-VL-72B showed superior performance on the larger SMMILE++ dataset. Domain-specific medical models did not consistently outperform general-purpose models of similar size.

⚠️ Critical Biases Identified

We discovered important limitations including recency bias (models heavily weight the last example, and sensitivity to example quality (one irrelevant example can reduce performance). Models particularly struggled with numerical answers (achieving 0% accuracy), quantitative reasoning tasks, and certain imaging modalities like MRI and medical illustrations.

🎯 Implications

These results highlight the substantial gap between current MLLM capabilities and the requirements for reliable clinical applications, pointing to important directions for future model development in medical multimodal in-context learning.

BibTeX

@article{rieff2025smmile, title={SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning}, author={Melanie Rieff and Maya Varma and Ossian Rabow and Subathra Adithan and Julie Kim and Ken Chang and Hannah Lee and Nidhi Rohatgi and Christian Bluethgen and Mohamed S. Muneer and Jean-Benoit Delbrouck and Michael Moor}, year={2025}, eprint={2506.21355}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.21355}, }