[Research] A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review

Fri, 14 Nov 2025 14:47:00 +0800

Paper Overview

Title: A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review

Authors: Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V. Stolyar, Katelyn Polanska, Karleigh R. McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, and Yanshan Wang

Journal/Conference: npj Digital Medicine

Year: 2024

DOI/Link: https://doi.org/10.1038/s41746-024-01258-7

This scoping review analyzes 142 studies of human evaluation for healthcare LLMs and argues that current practice is inconsistent, under-specified, and often too weak for high-risk clinical use cases.

Selected Figures

Figure 1. Healthcare applications of LLMs

This figure shows where human evaluation has been used most often: clinical decision support, medical education, patient education, and question answering.

Figure 7. QUEST human evaluation framework

This is the most important figure in the paper because it turns the review findings into a practical evaluation workflow.

Figure 9. PRISMA flow diagram