<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>QUEST on Shin Li</title><link>https://shin13.github.io/tags/quest/</link><description>Recent content in QUEST on Shin Li</description><generator>Hugo</generator><language>en-US</language><copyright>Shin Li</copyright><lastBuildDate>Wed, 06 May 2026 04:22:36 +0800</lastBuildDate><atom:link href="https://shin13.github.io/tags/quest/index.xml" rel="self" type="application/rss+xml"/><item><title>[Research] A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review</title><link>https://shin13.github.io/notes/framework-for-human-evaluation-of-large-language-models-in-healthcare-derived-from-literature-review/</link><pubDate>Fri, 14 Nov 2025 14:47:00 +0800</pubDate><guid>https://shin13.github.io/notes/framework-for-human-evaluation-of-large-language-models-in-healthcare-derived-from-literature-review/</guid><description>&lt;h2 id="paper-overview"&gt;Paper Overview&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Title:&lt;/strong&gt; A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Authors:&lt;/strong&gt; Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V. Stolyar, Katelyn Polanska, Karleigh R. McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, and Yanshan Wang&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Journal/Conference:&lt;/strong&gt; npj Digital Medicine&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2024&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DOI/Link:&lt;/strong&gt; &lt;a href="https://doi.org/10.1038/s41746-024-01258-7"&gt;https://doi.org/10.1038/s41746-024-01258-7&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This scoping review analyzes 142 studies of human evaluation for healthcare LLMs and argues that current practice is inconsistent, under-specified, and often too weak for high-risk clinical use cases.&lt;/p&gt;
&lt;h2 id="selected-figures"&gt;Selected Figures&lt;/h2&gt;
&lt;h3 id="figure-1-healthcare-applications-of-llms"&gt;Figure 1. Healthcare applications of LLMs&lt;/h3&gt;
&lt;p&gt;&lt;img alt="Fig. 1: Healthcare applications of LLMs." loading="lazy" src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-024-01258-7/MediaObjects/41746_2024_1258_Fig1_HTML.png"&gt;&lt;/p&gt;
&lt;p&gt;This figure shows where human evaluation has been used most often: clinical decision support, medical education, patient education, and question answering.&lt;/p&gt;
&lt;h3 id="figure-7-quest-human-evaluation-framework"&gt;Figure 7. QUEST human evaluation framework&lt;/h3&gt;
&lt;p&gt;&lt;img alt="Fig. 7: The proposed QUEST human evaluation framework, delineating the multi-stage process for evaluating healthcare-related LLMs." loading="lazy" src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-024-01258-7/MediaObjects/41746_2024_1258_Fig7_HTML.png"&gt;&lt;/p&gt;
&lt;p&gt;This is the most important figure in the paper because it turns the review findings into a practical evaluation workflow.&lt;/p&gt;
&lt;h3 id="figure-9-prisma-flow-diagram"&gt;Figure 9. PRISMA flow diagram&lt;/h3&gt;
&lt;p&gt;&lt;img alt="Fig. 9: Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram of the article screening and identification process." loading="lazy" src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-024-01258-7/MediaObjects/41746_2024_1258_Fig9_HTML.png"&gt;&lt;/p&gt;
&lt;p&gt;This figure summarizes the literature search and screening process behind the 142 included studies.&lt;/p&gt;</description></item></channel></rss>