[Research] A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review
Paper Overview Title: A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review Authors: Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V. Stolyar, Katelyn Polanska, Karleigh R. McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, and Yanshan Wang Journal/Conference: npj Digital Medicine Year: 2024 DOI/Link: https://doi.org/10.1038/s41746-024-01258-7 This scoping review analyzes 142 studies of human evaluation for healthcare LLMs and argues that current practice is inconsistent, under-specified, and often too weak for high-risk clinical use cases. Selected Figures Figure 1. Healthcare applications of LLMs This figure shows where human evaluation has been used most often: clinical decision support, medical education, patient education, and question answering. Figure 7. QUEST human evaluation framework This is the most important figure in the paper because it turns the review findings into a practical evaluation workflow. Figure 9. PRISMA flow diagram This figure summarizes the literature search and screening process behind the 142 included studies. ...