[{"content":"IBM Technology\u0026rsquo;s The 7 Skills You Need to Build AI Agents makes a point that feels increasingly true: if an agent can act in the real world, then prompt writing is only the starting point.\nThe more useful framing is this: prompt engineering is the recipe, but agent engineering is the kitchen.\nA production-grade agent needs structure, contracts, failure handling, traceability, and a clear product experience. In other words, it needs engineering discipline.\nFig 1: Break agent engineering into seven core skills, then fill in the gaps one module at a time. The 7 skills For engineers who are new to agentic systems, I’d turn the seven skills into these practical rules:\n1) System design Start with one control loop: state in, tool call out, result back, state updated. Keep planning, execution, and persistence separate so failures are easy to trace. 2) Tool and contract design Treat tools like strict APIs. OpenAPI exists for exact request/response contracts, not loose suggestions. Give the model the smallest safe tool set; validate every parameter before execution. 3) Retrieval engineering Retrieval quality sets the ceiling for the agent. Use chunking, metadata, and reranking rather than embeddings alone. Check whether the right evidence was retrieved before tuning prompts. 4) Reliability engineering Expect timeouts, rate limits, and duplicate calls. Make actions idempotent and add bounded retries. Use SLOs and error budgets to decide when to degrade or stop. 5) Security and safety Assume prompt injection and unsafe output handling. Keep untrusted text separate from tool instructions. Use least privilege, allow-lists, and human approval for high-impact actions. 6) Evaluation and observability Log prompts, retrieved context, tool calls, and final outcomes. OpenTelemetry-style traces help connect the whole flow. Build offline evals early, then compare them with real user traces and failure cases. 7) Product thinking Define success criteria before shipping. Anthropic’s prompt-engineering docs explicitly recommend clear success criteria and empirical tests. Add clarification, escalation, and graceful fallback paths so users can recover when the agent is uncertain. What I liked most is that the video refuses to romanticize agents. Real systems need boundaries. They need retries, fallbacks, logs, and a human-centered experience.\nThat also means a useful debugging instinct: when an agent fails, trace backward before you rewrite the prompt. Was the right document retrieved? Was the tool schema clear? Did the system fail before the model even had a chance to help?\nFig 2: An agent is not just a bigger model. It needs a balance of design, boundaries, permissions, and guidance. What to study next If you want to go deeper, these are good companion resources for each skill area:\n1) System design Designing Data-Intensive Applications Google SRE book 2) Tool and contract design API Design Patterns OpenAPI Specification Swagger OpenAPI Specification overview 3) Retrieval engineering OpenSearch Documentation Elasticsearch Reference Designing Machine Learning Systems 4) Reliability engineering Release It! Second Edition Google SRE book Testing Strategies in a Microservice Architecture 5) Security and safety OWASP Top 10 for Large Language Model Applications OpenAPI Specification for strict request/response contracts 6) Evaluation and observability OpenTelemetry docs Designing Machine Learning Systems 7) Product thinking Inspired The Product-Minded Engineer Continuous Discovery Habits My takeaway Fig 3: An LLM service is just an input-output loop. An agent system also includes tools, retrieval, state, retries, logs, and safety controls. The title \u0026ldquo;prompt engineer\u0026rdquo; still describes a useful entry point, but it no longer describes the full job.\nIf we want agents that people can trust, we need to think like engineers, product builders, and system designers at the same time.\nSource YouTube: https://youtu.be/mtiOK2QG9Q0?si=ITxYMB-1FnRJpcGp Video title: The 7 Skills You Need to Build AI Agents Channel: IBM Technology ","permalink":"https://shin13.github.io/notes/the-7-skills-you-need-to-build-ai-agents/","summary":"\u003cp\u003eIBM Technology\u0026rsquo;s \u003cem\u003eThe 7 Skills You Need to Build AI Agents\u003c/em\u003e makes a point that feels increasingly true: if an agent can act in the real world, then prompt writing is only the starting point.\u003c/p\u003e","title":"The 7 Skills You Need to Build AI Agents"},{"content":"I have been looking for a clean way to explain what /goal really does in Codex.\nThe most useful mental model I found is simple: /goal is not a prettier prompt. It is a working contract for long-running agent work. You are telling the agent what success looks like, what the boundary is, and how to know when to stop.\nThat framing matters because the feature is built for work that outlives one turn. If the objective is durable enough, the agent can keep making progress, validate its own steps, and come back to you with a result instead of a half-finished thought.\nWhat /goal is OpenAI describes /goal as an experimental Codex CLI feature for tasks that need Codex to keep working across turns toward a verifiable stopping condition.\nThat wording is surprisingly helpful. It says a good goal is not just “something to do.” It is a task with enough structure that the agent can keep moving without being steered every minute.\nI read that as a contract with four parts:\none objective one validation loop one stopping condition one sensible boundary In other words, it works best when the work is bigger than a normal prompt, but smaller than an open-ended backlog.\nWhere it fits well The official examples line up with what I would naturally reach for:\ncode migrations large refactors deployment retry loops experiments and prototypes games or side projects prompt optimization against an eval suite What these have in common is not just that they are long-running. It is that progress can be checked.\nThat is the key distinction for me. /goal is useful when the agent can move in checkpoints and prove that each step is still pointed at the same end state.\nHow I would write a good /goal A good goal should say more than “do this.” It should tell the agent what to preserve, how to validate, and when to stop.\nThe official starter pattern is:\n/goal Complete [objective] without stopping until [verifiable end state]. I would usually expand that a little in real work:\n/goal Migrate this feature from [legacy stack] to [target stack]. Keep behavior identical, run the relevant tests after each checkpoint, and stop only when the new path passes the validation suite and the rollback path still works. If I were using it for a prototype, I would make the success condition even more concrete:\n/goal Implement the first usable version of [project]. Keep the scope small, document the checkpoints, verify the output after each step, and stop only when the app builds, launches, and matches the expected behavior. My checklist is usually:\none objective one stopping condition a clear set of files or docs to read first commands or artifacts that prove progress a short progress log a way to pause, resume, or clear the run If I cannot write those down, the goal is probably too fuzzy.\nWhere to start Try to use your chatGPT and instruct the model to\nsearch and read official materials about /goal provide your task description for model draft a pormpt for /goal to complete your task What to watch out for This is where I think the real value is.\nThe biggest mistake is to use /goal like a loose backlog bucket. A durable objective is not the same thing as a list of unrelated tasks. If the work has no single success condition, the agent will drift.\nThe second mistake is to over-constrain the work. If the goal is so narrow that it only describes one exact path, you lose the benefit of having an agent explore and recover from small failures.\nThe third mistake is to forget that independence is not the same as permission. Even when the agent can work for a long time, I still want human judgment on scope, risk, and final acceptance.\nFor higher-risk work, I would be conservative. A good /goal should reduce steering overhead, not replace review.\nMy takeaway I like /goal because it makes the agreement explicit.\nThe best use case is not “let the model run forever.” It is “give the model a durable objective with a verifiable end state, and let it work in a loop until it gets there.”\nThat is a much better fit for agentic work than a stream of small commands.\nIf I can state the objective, the boundary, the validation, and the stopping condition in one paragraph, /goal is probably doing something useful. If I cannot, I should probably clarify the task first.\nReferences OpenAI Codex use case: Follow a goal — https://developers.openai.com/codex/use-cases/follow-goals OpenAI: Introducing ChatGPT agent — https://openai.com/index/introducing-chatgpt-agent/ OpenAI: Practices for governing agentic AI systems — https://openai.com/index/practices-for-governing-agentic-ai-systems/ ","permalink":"https://shin13.github.io/notes/following-a-goal-with-codex/","summary":"\u003cp\u003eI have been looking for a clean way to explain what \u003ccode\u003e/goal\u003c/code\u003e really does in Codex.\u003c/p\u003e\n\u003cp\u003eThe most useful mental model I found is simple: \u003ccode\u003e/goal\u003c/code\u003e is not a prettier prompt. It is a working contract for long-running agent work. You are telling the agent what success looks like, what the boundary is, and how to know when to stop.\u003c/p\u003e\n\u003cp\u003eThat framing matters because the feature is built for work that outlives one turn. If the objective is durable enough, the agent can keep making progress, validate its own steps, and come back to you with a result instead of a half-finished thought.\u003c/p\u003e","title":"[Dev] Following a Goal with Codex (/goal)"},{"content":"I recently read Matt Pocock’s article, “5 Agent Skills I Use Every Day”. It resonated with my experience using coding agents such as Claude Sonnet and Claude Opus.\nThe article gave me a clearer language for something I have been feeling: good agent work depends on good engineering process. We need better questions, written context, small slices, tests, and codebases that agents can understand.\nWho is Matt Pocock? Matt Pocock is an educator, content creator, and engineer. Many developers know him as the creator of Total TypeScript, a former Vercel developer advocate, and a former XState core team member.\nHe now teaches AI engineering through AI Hero. I appreciate his consistent message: software fundamentals become more valuable when agents can produce code quickly.\nThank you to Matt for sharing these practical resources so generously.\nThe idea: skills turn taste into process Matt describes coding agents as a “fleet of middling to good engineers” with one major weakness: they have no memory.\nThat framing feels honest. If agents forget context, then our process must carry the context. Skills give the agent a path to follow: clarify, document, slice, test, and improve the architecture.\nI started to wonder.\nIf an agent mirrors the quality of the process around it, what kind of engineer am I teaching it to become?\ngrill-me: ask before building grill-me asks the agent to interview the user relentlessly until the plan is clear. It walks the design tree one decision at a time.\nThis is one of my favorite skills. It turns Claude from a fast implementation machine into a thoughtful partner. It helps reveal hidden assumptions, missing constraints, and premature decisions.\nI especially enjoy this because I like discussing and designing beautiful systems. Sometimes the most valuable output from an agent is a better question.\nto-prd: make understanding explicit to-prd turns resolved context into a Product Requirements Document.\nThe value is the shared artifact. A PRD records the problem, user stories, implementation decisions, testing direction, and out-of-scope items. It gives both the human and the agent a stable reference point.\nFor healthcare AI and internal tools, this matters. Requirements often touch workflow, safety, traceability, evaluation, and deployment. A PRD keeps those layers visible.\nto-issues: build in vertical slices to-issues breaks a PRD into independently actionable GitHub issues.\nThe key idea is vertical slicing. Each issue should create a small, verifiable behavior across the system. This works better than splitting work into disconnected layers such as schema, API, UI, and tests.\nThe “tracer bullet” metaphor is useful. A small end-to-end path through the system teaches us more than a large unfinished layer.\ntdd: give the agent a real feedback loop tdd guides the agent through red, green, refactor.\nThis improves agent output because tests anchor the work in observable behavior. Good tests describe the public behavior of the system. They survive refactors and reduce hallucinated implementation paths.\nThe rhythm is simple:\nConfirm one behavior. Write a failing test. Implement the smallest change. Run the test. Refactor after green. improve-codebase-architecture: make the system legible improve-codebase-architecture feels especially powerful to me.\nAgents inherit the shape of the codebase. Clear names, deep modules, stable interfaces, and coherent boundaries help them reason. Scattered concepts and shallow abstractions make them wander.\nWhen I use Claude Sonnet or Opus with this skill, the conversation becomes deeper. The agent starts noticing coupling, boundaries, test seams, and module shape. It brings out more treasure from the codebase.\nThis matches a first-principles view: intelligence needs a structure to act on. A beautiful system gives both humans and agents fewer things to hold in working memory.\nWhat I learned These skills did not simply make the agent “smarter.” They made the conversation more disciplined.\nThe agent asked better questions. It wrote with more context. It decomposed work more carefully. It treated tests as part of design. It saw architecture as part of the agent workflow.\nThat is the deeper lesson for me. Agentic coding is a collaborative system: human judgment, written process, tests, and architecture working together.\nResources Matt’s article: 5 Agent Skills I Use Every Day AI Hero: https://www.aihero.dev/ Matt Pocock’s personal site: https://www.mattpocock.com/ Skills collection: AI Skills for Real Engineers GitHub repository: mattpocock/skills YouTube Full Walkthrough: Workflow for AI Coding — Matt Pocock \u0026ldquo;Software Fundamentals Matter More Than Ever\u0026rdquo; — Matt Pocock ","permalink":"https://shin13.github.io/notes/learning-from-matt-pocock-agent-skills/","summary":"\u003cp\u003eI recently read Matt Pocock’s article, \u003ca href=\"https://www.aihero.dev/5-agent-skills-i-use-every-day\"\u003e“5 Agent Skills I Use Every Day”\u003c/a\u003e. It resonated with my experience using coding agents such as Claude Sonnet and Claude Opus.\u003c/p\u003e\n\u003cp\u003eThe article gave me a clearer language for something I have been feeling: good agent work depends on good engineering process. We need better questions, written context, small slices, tests, and codebases that agents can understand.\u003c/p\u003e","title":"[Dev] Learning from Matt Pocock’s Agent Skills"},{"content":"I’ve been using Streamlit for quick internal tools and dashboards, but a colleague introduced me to Reflex, so I’m trying it out as another way to build Python web apps.\nWhat caught my attention is that Reflex is a full-stack Python framework for building web apps with UI, state, backend logic, data models, and deployment in one codebase. This is especially suitable for Python backend developers who seek to build more scalable and production-ready web apps.\nWhat Reflex is Reflex describes itself as an open-source Python framework for building full-stack web apps in pure Python. The docs highlight a few things that make it interesting:\nYou build the UI in Python App state and event handlers live in Python The framework supports backend logic and database integration You can run locally and deploy from the same workflow In practice, that makes Reflex feel closer to a Python-first app framework than a lightweight plotting/dashboard layer.\nQuick start for developers According to the Reflex docs, the simplest local setup is:\nmkdir my-app cd my-app uv init uv add reflex uv run reflex init uv run reflex run A few notes:\nReflex recommends Python 3.10+ uv is the preferred project and package manager in the docs reflex init creates a new app in the current directory reflex run starts the app in development mode with hot reload If you want logs while debugging, you can run:\nuv run reflex run --loglevel debug A tiny mental model If you’re coming from Streamlit, the easiest way to think about Reflex is:\nStreamlit: very fast for data apps and internal dashboards Reflex: more structured for building app-like experiences with state, routing, and a larger component model That means Reflex may take a bit more setup, but it may also scale better when the app starts feeling less like a notebook and more like a product.\nMinimal example shape A Reflex app usually has:\na state class UI components event handlers that update state a page layout built from components That separation is useful if you want your app to grow beyond a single script.\nA simple beginner path is:\nInitialize the app with reflex init Open the generated project structure Find the page and state files Change one text label Add one button and one event handler Run reflex run and watch hot reload That small loop is enough to understand the framework’s core model quickly.\nUseful links Reflex docs home: https://reflex.dev/docs/ Reflex docs index for LLMs: https://reflex.dev/docs/llms.txt Installation: https://reflex.dev/docs/getting-started/installation/ Introduction: https://reflex.dev/docs/getting-started/introduction/ CLI reference: https://reflex.dev/docs/api-reference/cli/ Component library: https://reflex.dev/docs/library/ State overview: https://reflex.dev/docs/state/overview/ GitHub repo: https://github.com/reflex-dev/reflex Useful things to remember reflex init creates the app scaffold reflex run is the main local dev command reflex deploy is the deployment path in the CLI uv run keeps the workflow isolated and reproducible The docs also provide markdown-friendly pages for AI tools through llms.txt My first impression Reflex looks promising if I want to build something more app-like than a Streamlit script, especially when I care about state, structure, and a cleaner separation between UI and logic.\nI’m still early in the learning curve, but this seems worth a serious try for projects that may outgrow a quick dashboard.\nConclusion For now, I’d describe Reflex as:\na good fit for Python-first product prototypes a fuller app framework than Streamlit a candidate for more structured internal tools and workflows I’ll keep experimenting with it and see where it fits best.\n","permalink":"https://shin13.github.io/notes/trying-reflex-python-for-web-apps/","summary":"\u003cp\u003eI’ve been using Streamlit for quick internal tools and dashboards, but a colleague introduced me to Reflex, so I’m trying it out as another way to build Python web apps.\u003c/p\u003e\n\u003cp\u003eWhat caught my attention is that Reflex is a full-stack Python framework for building web apps with UI, state, backend logic, data models, and deployment in one codebase. This is especially suitable for Python backend developers who seek to build more scalable and production-ready web apps.\u003c/p\u003e","title":"[Dev] Trying Reflex (Python) for Web Apps"},{"content":"Paper Overview Title: Optimizing Order Sets With a Large Language Model–Powered Multiagent System\nAuthors: Liu S, Huang SS, McCoy AB, Wright AP, Horst S, Wright A\nJournal: JAMA Network Open\nYear: 2025\nDOI: https://doi.org/10.1001/jamanetworkopen.2025.33277\nWhy This Paper? I read this paper because it sits at the intersection of clinical pharmacy, healthcare workflow, and practical AI systems.\nRelevant to clinical decision support and order-set maintenance Uses a multiagent LLM design instead of a single-model prompt Shows the gap between factual correctness and actual clinical usefulness Offers a good example of expert alignment in a high-stakes domain This article is a cleaned-up conversion of my original blog post into the site’s Notes format.\nKey Findings Main Contributions The authors built a five-agent system for reviewing and improving hospital order sets. The system combined retrieval-augmented generation with domain-specific verification and summarization roles. A small set of physician-rated examples improved the judge model’s alignment with expert judgment. Methodology Highlights Approach: LLM-powered multiagent workflow with retrieval, critique, verification, and summarization Data: Hospital order sets plus internal and external medical knowledge sources Novel Aspects: The system was designed to mimic how an expert team would distribute work rather than relying on a single general-purpose model Selected Figures Figure 1. Overview of the multiagent system architecture and evaluation workflow This is the best high-level figure in the paper. It shows the five-agent workflow and the two evaluation phases, so it immediately explains how the system was built and assessed.\nFigure 2. LLM-as-a-judge alignment and customized filter This figure is important because it shows the calibration step: the authors did not stop at raw LLM scoring, but used physician feedback to align a usefulness filter with expert preferences.\nFigure 3. Physician ratings of AI-generated suggestions This chart captures the paper’s core message well: suggestions can look accurate while still being less useful or feasible in practice.\nMy Takeaways Immediately Applicable A technically correct suggestion is not necessarily useful in a real workflow. Small amounts of high-quality expert feedback can meaningfully improve an AI judge. In healthcare, context matters as much as correctness. LLM systems are often best framed as support layers, not replacements for expert review. Future Exploration Compare multiagent and single-agent approaches for order-set review Evaluate whether local workflow alignment improves clinical adoption Study how much expert calibration is enough before diminishing returns Questions \u0026amp; Critiques Questions Raised How generalizable is this setup across institutions with different workflows and knowledge bases? What is the best way to measure usefulness beyond physician ratings? Potential Limitations Single-center context may limit generalizability Useful suggestions can still be missed if the workflow is too local or vague The study emphasizes practical review efficiency more than end-to-end patient outcome impact Implementation Ideas For Current Projects Project: clinical AI decision support workflows Application: use the paper’s expert-alignment idea for filtering or ranking recommendations Timeline: when reviewing retrieval or recommendation pipelines New Project Possibilities A lightweight expert-alignment layer for clinical suggestion ranking A workflow for turning raw AI suggestions into reviewable, context-aware recommendations Related Work Papers to Read Next A framework for human evaluation of large language models in healthcare Evaluation of generative large language models in stroke care Connections to Previous Reading Connects to other research notes on healthcare LLM evaluation, RAG systems, and human review workflows Rating \u0026amp; Recommendation My Rating: ⭐⭐⭐⭐☆\nRecommend for:\nHealthcare professionals working on clinical decision support Researchers studying LLM evaluation in medicine Engineers building RAG or multiagent workflows Anyone interested in expert alignment for high-stakes AI Time Investment: A few hours to read, extract, and rewrite into note form\nReference Liu S, Huang SS, McCoy AB, Wright AP, Horst S, Wright A. Optimizing Order Sets With a Large Language Model–Powered Multiagent System. JAMA Network Open. 2025;8(9):e2533277. DOI: https://doi.org/10.1001/jamanetworkopen.2025.33277 ","permalink":"https://shin13.github.io/notes/optimizing-order-sets-with-large-language-model-powered-multiagent-system/","summary":"\u003ch2 id=\"paper-overview\"\u003ePaper Overview\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eTitle:\u003c/strong\u003e Optimizing Order Sets With a Large Language Model–Powered Multiagent System\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors:\u003c/strong\u003e Liu S, Huang SS, McCoy AB, Wright AP, Horst S, Wright A\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eJournal:\u003c/strong\u003e JAMA Network Open\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eYear:\u003c/strong\u003e 2025\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDOI:\u003c/strong\u003e \u003ca href=\"https://doi.org/10.1001/jamanetworkopen.2025.33277\"\u003ehttps://doi.org/10.1001/jamanetworkopen.2025.33277\u003c/a\u003e\u003c/p\u003e\n\u003ch2 id=\"why-this-paper\"\u003eWhy This Paper?\u003c/h2\u003e\n\u003cp\u003eI read this paper because it sits at the intersection of clinical pharmacy, healthcare workflow, and practical AI systems.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eRelevant to clinical decision support and order-set maintenance\u003c/li\u003e\n\u003cli\u003eUses a multiagent LLM design instead of a single-model prompt\u003c/li\u003e\n\u003cli\u003eShows the gap between factual correctness and actual clinical usefulness\u003c/li\u003e\n\u003cli\u003eOffers a good example of expert alignment in a high-stakes domain\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThis article is a cleaned-up conversion of my original blog post into the site’s Notes format.\u003c/p\u003e","title":"[Research] Optimizing Order Sets With a Large Language Model–Powered Multiagent System"},{"content":"Paper Overview Title: A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review\nAuthors: Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V. Stolyar, Katelyn Polanska, Karleigh R. McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, and Yanshan Wang\nJournal/Conference: npj Digital Medicine\nYear: 2024\nDOI/Link: https://doi.org/10.1038/s41746-024-01258-7\nThis scoping review analyzes 142 studies of human evaluation for healthcare LLMs and argues that current practice is inconsistent, under-specified, and often too weak for high-risk clinical use cases.\nSelected Figures Figure 1. Healthcare applications of LLMs This figure shows where human evaluation has been used most often: clinical decision support, medical education, patient education, and question answering.\nFigure 7. QUEST human evaluation framework This is the most important figure in the paper because it turns the review findings into a practical evaluation workflow.\nFigure 9. PRISMA flow diagram This figure summarizes the literature search and screening process behind the 142 included studies.\nWhy This Paper? I read this because it directly addresses a problem that keeps coming up in medical AI:\nwe want LLMs to be useful in healthcare, but automatic metrics are not enough, and human review is often done in a way that is not rigorous enough. This is especially relevant for clinical decision support, where the risk is high and the cost of weak evaluation is much larger than in a normal consumer app.\nKey Findings Main Contributions The authors performed a scoping review of 142 studies on human evaluation of healthcare LLMs. They found major methodological gaps: limited blinding, inconsistent comparison baselines, and small evaluator counts in high-risk settings. They proposed QUEST, a structured framework for more standardized human evaluation in healthcare. Methodology Highlights Approach: Scoping review following PRISMA-ScR style reporting Coverage: English peer-reviewed studies from 2018 to 2024 Novel Aspects: The review does not just criticize current practice; it turns the findings into an actionable evaluation framework My Takeaways Immediately Applicable High-risk clinical AI should not be validated with tiny expert panels and loose review criteria. Blinded human evaluation matters more than it is often treated in practice. If a system is meant to support real clinicians, the evaluation must include usefulness, not only correctness. Human evaluation should be planned like a protocol, not treated as an informal afterthought. Future Exploration Use QUEST-like dimensions when designing future evaluation rubrics Compare how much expert agreement can be achieved with better reviewer instructions Build lighter-weight but still rigorous evaluation workflows for local clinical AI projects Questions \u0026amp; Critiques Questions Raised How much of QUEST is directly transferable across institutions with very different workflows? What is the minimum reviewer set that still produces trustworthy results in practice? Potential Limitations The framework is broad and may still need local adaptation Evaluation quality is constrained by available experts and time A good framework does not automatically solve the labor cost of human review Implementation Ideas For Current Projects Project: clinical LLM evaluation workflow Application: use QUEST categories to structure a review form Timeline: before running the next internal model review New Project Possibilities A lightweight rubric generator for healthcare LLM review A reviewer training checklist for human evaluation in clinical AI Related Work Papers to Read Next Evaluation of generative large language models in stroke care Papers on human evaluation methods for clinical LLMs and medical QA systems Connections to Previous Reading Connects well with other notes on healthcare LLM evaluation, RAG systems, and clinical decision support Rating \u0026amp; Recommendation My Rating: ⭐⭐⭐⭐⭐\nRecommend for:\nHealthcare AI researchers Clinical informatics teams Engineers building medical LLM systems Anyone designing human evaluation workflows for high-stakes AI Time Investment: A solid paper review session, plus extra time to think through how to apply the framework locally\nReference The blog post summarizes a literature review on human evaluation of large language models in healthcare and introduces the QUEST framework. Source article: https://www.nature.com/articles/s41746-024-01258-7 DOI: https://doi.org/10.1038/s41746-024-01258-7 ","permalink":"https://shin13.github.io/notes/framework-for-human-evaluation-of-large-language-models-in-healthcare-derived-from-literature-review/","summary":"\u003ch2 id=\"paper-overview\"\u003ePaper Overview\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eTitle:\u003c/strong\u003e A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors:\u003c/strong\u003e Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V. Stolyar, Katelyn Polanska, Karleigh R. McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, and Yanshan Wang\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eJournal/Conference:\u003c/strong\u003e npj Digital Medicine\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eYear:\u003c/strong\u003e 2024\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDOI/Link:\u003c/strong\u003e \u003ca href=\"https://doi.org/10.1038/s41746-024-01258-7\"\u003ehttps://doi.org/10.1038/s41746-024-01258-7\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eThis scoping review analyzes 142 studies of human evaluation for healthcare LLMs and argues that current practice is inconsistent, under-specified, and often too weak for high-risk clinical use cases.\u003c/p\u003e\n\u003ch2 id=\"selected-figures\"\u003eSelected Figures\u003c/h2\u003e\n\u003ch3 id=\"figure-1-healthcare-applications-of-llms\"\u003eFigure 1. Healthcare applications of LLMs\u003c/h3\u003e\n\u003cp\u003e\u003cimg alt=\"Fig. 1: Healthcare applications of LLMs.\" loading=\"lazy\" src=\"https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-024-01258-7/MediaObjects/41746_2024_1258_Fig1_HTML.png\"\u003e\u003c/p\u003e\n\u003cp\u003eThis figure shows where human evaluation has been used most often: clinical decision support, medical education, patient education, and question answering.\u003c/p\u003e\n\u003ch3 id=\"figure-7-quest-human-evaluation-framework\"\u003eFigure 7. QUEST human evaluation framework\u003c/h3\u003e\n\u003cp\u003e\u003cimg alt=\"Fig. 7: The proposed QUEST human evaluation framework, delineating the multi-stage process for evaluating healthcare-related LLMs.\" loading=\"lazy\" src=\"https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-024-01258-7/MediaObjects/41746_2024_1258_Fig7_HTML.png\"\u003e\u003c/p\u003e\n\u003cp\u003eThis is the most important figure in the paper because it turns the review findings into a practical evaluation workflow.\u003c/p\u003e\n\u003ch3 id=\"figure-9-prisma-flow-diagram\"\u003eFigure 9. PRISMA flow diagram\u003c/h3\u003e\n\u003cp\u003e\u003cimg alt=\"Fig. 9: Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram of the article screening and identification process.\" loading=\"lazy\" src=\"https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-024-01258-7/MediaObjects/41746_2024_1258_Fig9_HTML.png\"\u003e\u003c/p\u003e\n\u003cp\u003eThis figure summarizes the literature search and screening process behind the 142 included studies.\u003c/p\u003e","title":"[Research] A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review"},{"content":"This article offers a sample of basic Markdown syntax that can be used in Hugo content files.\nBasic Syntax Headings # Heading 1 ## Heading 2 ### Heading 3 #### Heading 4 ##### Heading 5 ###### Heading 6 Heading 2 Heading 3 Heading 4 Heading 5 Heading 6 Emphasis *This text will be italic* _This will also be italic_ **This text will be bold** __This will also be bold__ _You **can** combine them_ This text will be italic\nThis will also be italic\nThis text will be bold\nThis will also be bold\nYou can combine them\nLists Unordered * Item 1 * Item 2 * Item 2a * Item 2b Item 1 Item 2 Item 2a Item 2b Ordered 1. Item 1 2. Item 2 3. Item 3 1. Item 3a 2. Item 3b Images ![GitHub Logo](https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png) Links [Hugo](https://gohugo.io) Hugo\nBlockquotes As Newton said: \u0026gt; If I have seen further it is by standing on the shoulders of Giants. If I have seen further it is by standing on the shoulders of Giants.\nInline Code Inline `code` has `back-ticks around` it. Inline code has back-ticks around it.\nCode Blocks Syntax Highlighting ```go func main() { fmt.Println(\u0026#34;Hello World\u0026#34;) } ``` func main() { fmt.Println(\u0026#34;Hello World\u0026#34;) } Tables | Syntax | Description | | --------- | ----------- | | Header | Title | | Paragraph | Text | Syntax Description Header Title Paragraph Text References Markdown Syntax Hugo Markdown ","permalink":"https://shin13.github.io/notes/markdown/","summary":"\u003cp\u003eThis article offers a sample of basic Markdown syntax that can be used in Hugo content files.\u003c/p\u003e","title":"Markdown Syntax Guide"},{"content":"Hit the ground running This is my first post.\nAdd my first blog for the website Plan out next few steps to track progress and stay focus Next steps for building this site:\nCreate an About page Create an Now page Create a Project page Add screenshots to assets for project previews Change Avitar / Logo Remove zh-CN Add zh-TW documents Add menu.main About Now Project Blog (optional) Resume Add Github link (optional) Add Linkedin link ","permalink":"https://shin13.github.io/notes/first-post/","summary":"\u003ch2 id=\"hit-the-ground-running\"\u003eHit the ground running\u003c/h2\u003e\n\u003cp\u003eThis is my first post.\u003c/p\u003e","title":"First Post"},{"content":"I’m Shin Li, a pharmacist, engineer, healthcare AI researcher, educator, and lifelong learner based in Taipei.\nI spend much of my time at the edges between domains: clinical pharmacy and software, healthcare workflows and AI systems, research and teaching, structure and creativity. I like making complex things easier to understand, and I care about tools that are not only technically interesting, but also useful in real clinical and human contexts.\nThreads in my life Healthcare and pharmacy My background is in pharmacy and clinical practice. That experience shapes how I think about healthcare technology: real workflows are messy, context matters, and good tools should respect professional judgment rather than replace it.\nAI, data, and systems I work with LLMs, retrieval-augmented generation, healthcare data standards, documentation automation, and evaluation methods for medical AI. I’m especially interested in systems that connect clinical knowledge with reliable, testable, and practical workflows.\nResearch and knowledge work I read, write, summarize, and organize ideas as a way to think. Research papers, technical documentation, project notes, and teaching materials all become part of the same larger practice: turning scattered information into clearer understanding.\nTeaching and explaining I enjoy helping people learn difficult things. Colleagues have often described me as a good explainer, and I see teaching as one of the best ways to test whether I truly understand something.\nLife outside work I also care about music, reading, exercise, personal knowledge systems, and living with more clarity and less noise. This website keeps space for those parts too.\nContact Email: soobahorn@gmail.com LinkedIn: linkedin.com/in/shin-li GitHub: github.com/shin13 ","permalink":"https://shin13.github.io/about/","summary":"\u003cp\u003eI’m Shin Li, a pharmacist, engineer, healthcare AI researcher, educator, and lifelong learner based in Taipei.\u003c/p\u003e\n\u003cp\u003eI spend much of my time at the edges between domains: clinical pharmacy and software, healthcare workflows and AI systems, research and teaching, structure and creativity. I like making complex things easier to understand, and I care about tools that are not only technically interesting, but also useful in real clinical and human contexts.\u003c/p\u003e","title":"About"},{"content":" What I’m doing now, what I’m paying attention to, and what is shaping my days. The /now page is part of a movement started by Derek Sivers and Gregory Brown, encouraging people to keep a simple page about their current focus.\nWork / Research Building and studying clinically grounded AI systems for healthcare. Working around LLMs, retrieval-augmented generation, medical AI evaluation, FHIR, and clinical documentation workflows. Thinking about how AI tools can support clinical work without flattening clinical judgment. Learning Deepening my understanding of AI agents, healthcare data standards, evaluation methods, and knowledge systems. Reading papers and translating what I learn into notes, talks, and practical experiments. Continuing to improve how I explain difficult ideas across clinical and technical communities. Building Small systems for organizing knowledge, tasks, research notes, and daily review. Healthcare-related tools and prototypes that connect clinical knowledge with usable software. This website as a calmer personal home for notes, projects, and current questions. Life Practicing music, especially French horn, as a long-term craft outside work. Keeping exercise, reading, reflection, and simple routines as anchors. Looking for ways to live with more clarity, usefulness, and spaciousness. Questions I keep returning to How can AI tools make healthcare work safer, clearer, and more humane? What kinds of personal systems actually help people think and live better? How can I move between pharmacy, software, research, teaching, and life without reducing myself to only one identity? Last updated: May 2026.\n","permalink":"https://shin13.github.io/now/","summary":"\u003cdiv class=\"hx-mt-4\"\u003e\u003c/div\u003e\n\u003cp class=\"hx-mb-12 hx-text-center hx-text-lg hx-text-gray-500 dark:hx-text-gray-400\"\u003e\nWhat I’m doing now, what I’m paying attention to, and what is shaping my days.\n\u003c/p\u003e\n\u003cp\u003eThe \u003ccode\u003e/now\u003c/code\u003e page is part of a movement started by \u003ca href=\"https://sive.rs/now\"\u003eDerek Sivers\u003c/a\u003e and \u003ca href=\"https://nownownow.com/about\"\u003eGregory Brown\u003c/a\u003e, encouraging people to keep a simple page about their current focus.\u003c/p\u003e\n\u003ch2 id=\"work--research\"\u003eWork / Research\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003eBuilding and studying clinically grounded AI systems for healthcare.\u003c/li\u003e\n\u003cli\u003eWorking around LLMs, retrieval-augmented generation, medical AI evaluation, FHIR, and clinical documentation workflows.\u003c/li\u003e\n\u003cli\u003eThinking about how AI tools can support clinical work without flattening clinical judgment.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"learning\"\u003eLearning\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003eDeepening my understanding of AI agents, healthcare data standards, evaluation methods, and knowledge systems.\u003c/li\u003e\n\u003cli\u003eReading papers and translating what I learn into notes, talks, and practical experiments.\u003c/li\u003e\n\u003cli\u003eContinuing to improve how I explain difficult ideas across clinical and technical communities.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"building\"\u003eBuilding\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003eSmall systems for organizing knowledge, tasks, research notes, and daily review.\u003c/li\u003e\n\u003cli\u003eHealthcare-related tools and prototypes that connect clinical knowledge with usable software.\u003c/li\u003e\n\u003cli\u003eThis website as a calmer personal home for notes, projects, and current questions.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"life\"\u003eLife\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003ePracticing music, especially French horn, as a long-term craft outside work.\u003c/li\u003e\n\u003cli\u003eKeeping exercise, reading, reflection, and simple routines as anchors.\u003c/li\u003e\n\u003cli\u003eLooking for ways to live with more clarity, usefulness, and spaciousness.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"questions-i-keep-returning-to\"\u003eQuestions I keep returning to\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003eHow can AI tools make healthcare work safer, clearer, and more humane?\u003c/li\u003e\n\u003cli\u003eWhat kinds of personal systems actually help people think and live better?\u003c/li\u003e\n\u003cli\u003eHow can I move between pharmacy, software, research, teaching, and life without reducing myself to only one identity?\u003c/li\u003e\n\u003c/ul\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cem\u003eLast updated: May 2026.\u003c/em\u003e\u003c/p\u003e","title":"Now"},{"content":"These products / projects, some are public tools, some are research traces, and some are small systems that helped me understand a problem better.\nThey are a map of recurring interests: healthcare, AI, knowledge systems, teaching, and the practical work of making complex things clearer.\nHealthcare knowledge systems Work shaped by pharmacy practice, medication knowledge, and the need for clear clinical information at the right moment.\nHospital Formulary A digital formulary knowledge base created from hospital pharmacy practice.\nWhy it exists\nI wanted medication information to be easier to search, maintain, and explain in daily clinical work.\nWhat it connects\nClinical pharmacy, knowledge organization, documentation, and healthcare workflow design.\nWhat I learned\nGood information systems are not only about storing facts; they are about reducing friction at the moment of decision.\nResearch and presentations Research traces, posters, talks, and other artifacts from trying to understand healthcare problems more clearly.\nScientific Poster at ASHP 2023 Effect of an improved antimicrobial stewardship program at a regional hospital in Taiwan.\nAI and automation experiments Small systems and prototypes around healthcare AI, retrieval, documentation, and workflow automation. Some are public; many are still private, internal, or evolving.\nMedical AI / RAG experiments — exploring how retrieval and language models can support clinical knowledge work without replacing clinical judgment. Documentation automation — using software to reduce repetitive writing and make healthcare documentation clearer. Paper discovery and review workflows — systems for finding, reading, and preparing research for journal clubs and presentations. Agentic workflows — experimenting with AI agents as helpers for research, coding, note review, and task orchestration. Learning and personal systems Systems I use to think, remember, review, and decide what deserves attention.\nObsidian knowledge system — notes, daily logs, research fragments, project planning, and periodic review. Daily and weekly review routines — small practices for turning scattered tasks and thoughts into clearer next actions. Reading and paper notes — a growing habit of translating what I read into reusable understanding. Website as a personal home — this site itself is also a project: a quieter place to gather what I’m learning, building, and becoming. Teaching and community Projects are not always software. Some are formats for sharing, explaining, and learning with other people.\nClinical teaching — helping learners connect pharmacy knowledge with real patient-care decisions. AI and healthcare education — preparing explanations, workshops, and journal club materials for mixed clinical and technical audiences. Book clubs and study groups — creating spaces where people can read, ask better questions, and learn together. Status This page is intentionally incomplete. I expect it to change as old projects become clearer, private experiments become shareable, and new questions start to matter more.\n","permalink":"https://shin13.github.io/projects/","summary":"\u003cp\u003eThese products / projects, some are public tools, some are research traces, and some are small systems that helped me understand a problem better.\u003c/p\u003e\n\u003cp\u003eThey are a map of recurring interests: healthcare, AI, knowledge systems, teaching, and the practical work of making complex things clearer.\u003c/p\u003e\n\u003ch2 id=\"healthcare-knowledge-systems\"\u003eHealthcare knowledge systems\u003c/h2\u003e\n\u003cp\u003eWork shaped by pharmacy practice, medication knowledge, and the need for clear clinical information at the right moment.\u003c/p\u003e\n\u003ch3 id=\"hospital-formulary\"\u003e\u003ca href=\"https://shin13.gitbook.io/formulary\"\u003eHospital Formulary\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eA digital formulary knowledge base created from hospital pharmacy practice.\u003c/p\u003e","title":"Projects"}]