Link to Source: arXiv Preprint, Interactive Visualization Page
Authors: Seyed Amir Ahmad Safavi-Naini, Elahe Meftah, Josh Mohess, Pooya Mohammadi Kazaj, Georgios Siontis, Zahra Atf, Peter R. Lewis, Mauricio Reyes, Girish Nadkarni, Roland Wiest, Stephan Windecker, Christoph Grani, Ali Soroush, Isaac Shiri
Summary: This work introduces the Clinical World Model and Clinical AI Skill-Mix, a shared framework that organises medical AI competency across billions of clinical contexts and reframes the field’s central question from whether clinical AI works to the coordinates in which it has demonstrated reliability, and for whom.
Clinical AI frequently performs well on benchmarks yet degrades in deployment, a gap that reflects the absence of a shared formal model of the clinical world. This work introduces the Clinical World Model and the Clinical AI Skill-Mix, a common grammar that organizes medical AI competency across billions of distinct clinical contexts and reframes evaluation around the coordinates in which reliability has been demonstrated, and for whom.
Clinical artificial intelligence has progressed rapidly, yet a consistent gap separates benchmark performance from clinical reliability. Models achieve high scores on curated datasets and medical licensing examinations, but performance often degrades when they encounter real patients, heterogeneous equipment, and the uncertainty inherent in clinical reasoning. A systematic review of externally validated radiology models found that fewer than six percent maintained their original performance, with the area under the curve declining by approximately eight percent on external validation. Agentic architectures, which augment language models with planning, memory, and tool use, inherit this unreliability while introducing cascading risk, since an early error can propagate through sequential reasoning into an incorrect recommendation. This gap is not solely technical in origin. Existing work addresses evaluation, regulation, and system design in relative isolation, without a shared formal account of the clinical world to connect these efforts, which leaves stakeholders describing the same systems through incommensurable vocabularies.
We propose three interconnected models grounded in validated principles of clinical cognition and human factors. The Clinical World Model formalizes care as a tripartite interaction among Patient, Provider, and Ecosystem, recovering structure that prior frameworks share implicitly rather than introducing an independent account. Parallel decision-making architectures specify how providers, patients, and AI agents transform information into action, mapping human cognitive components such as dual-process reasoning, illness scripts, and metacognitive monitoring onto their computational counterparts. The Clinical AI Skill-Mix then operationalizes competency through eight dimensions, five that characterize the clinical scenario (condition, care phase, care setting, provider role, and task) and three that specify how AI engages human reasoning (assigned authority, agent facing, and anchoring layer).
The combinatorial product of these dimensions defines a competency space of billions of distinct coordinates, and this scale has a direct structural implication. Validation within one coordinate provides minimal evidence for performance in another, rendering the competency space irreducible and indicating that a single-task model, however accurate, addresses only a small fraction of the competencies required for clinical action. The framework supplies a common grammar through which clinicians, regulators, and developers can specify, evaluate, and bound a given system in consistent terms, including the points at which authority shifts as agents hand off work to one another. On this account, the central question moves from whether clinical AI works to the competency coordinates in which a system has demonstrated reliability, and for whom.