In an era where artificial intelligence continues to reshape the landscape of healthcare, researchers at the Mount Sinai Health System have unveiled a transformative approach to medical coding that promises to elevate accuracy and efficiency in clinical documentation. The study, featured in the latest issue of NEJM AI, demonstrates how a nuanced adjustment in the way large language models (LLMs) assign diagnostic codes can drastically enhance their performance, rivaling and in some scenarios surpassing human coders.
Medical coding, particularly with the International Classification of Diseases (ICD) system, is a critical but painstaking process integral to patient care, billing, and healthcare analytics. Physicians in the United States dedicate significant time weekly to coding diagnoses, a task fraught with complexity given the breadth of conditions and specificity required. Despite their prowess, leading AI models like ChatGPT traditionally struggle to assign precise ICD codes. Such inaccuracies can lead to billing errors, compromised patient records, and inefficient clinical workflows. This new research takes a bold step toward remedying these challenges by incorporating a reflective “lookup-before-coding” mechanism into the AI’s diagnostic process.
The methodology hinges on prompting the AI models to first interpret and generate a plain-language diagnostic description based on the clinical notes. Unlike traditional AI frameworks that attempt direct code assignment, this approach enriches the model’s context by subsequently retrieving the ten most similar ICD descriptions from an extensive database containing over one million hospital records. Crucially, this retrieval is weighted by the prevalence of these diagnoses, allowing the model to sift through real-world clinical patterns before finalizing its code selection. The technique effectively combines generative AI’s reasoning with an evidence-based retrieval step, mitigating guesswork that previously plagued automated coding systems.
This dual-step process was rigorously tested on 500 anonymized Emergency Department patient visits in Mount Sinai hospitals. The researchers engaged nine separate AI models, ranging from large proprietary systems to more modest open-source architectures, to classify each patient’s primary diagnosis. Their codes were then evaluated blindly by practicing emergency physicians and independent AI systems to ensure unbiased appraisal of coding accuracy. The results revealed that every model enhanced by retrieval outperformed their non-retrieval counterparts. Remarkably, even smaller open-source models showed marked improvements when equipped with the capacity to cross-reference against real clinical examples.
The implications of this breakthrough are manifold. Primarily, the ability to streamline and augment physician coding can alleviate the substantial administrative burden doctors face, potentially freeing up hours every week that could be redirected toward patient care. Furthermore, hospitals could see a reduction in billing inaccuracies, a persistent issue that affects revenue cycles and reimbursement processes. Quality of medical records, the backbone of clinical decision-making and epidemiological research, could also experience significant advancement in precision and completeness.
Professor Eyal Klang, one of the study’s senior authors and a leading figure in generative AI applications at Icahn School of Medicine, highlights the importance of reflective reasoning in AI’s diagnostic journey. By granting the model an opportunity to consult similar past cases, the team observed a substantial drop in nonsensical or erroneous code assignments that previous AI systems often produced in isolation. This advance exemplifies a move away from blind automation toward intelligent augmentation where AI acts as a reliable assistant rather than a speculative coder.
Girish N. Nadkarni, co-senior author and Chair of Mount Sinai’s Windreich Department of Artificial Intelligence and Human Health, emphasizes that this innovation is not meant to phase out human oversight but to complement it. The researchers envision the retrieval-enhanced system as a supportive tool integrated into electronic health records to propose codes or flag potential mistakes before billing, ensuring both efficiency and accuracy. The system remains in clinical evaluation phases, pending approval for widespread billing applications, but early results are promising in terms of scalability and transparency.
Mount Sinai’s initiative also reflects a broader commitment to ethical and responsible AI integration in medicine. The research capitalizes on an extensive, high-quality database of patient records, reinforcing the importance of data-driven validation and continuous feedback loops in medical AI tools. Moreover, the application of retrieval-augmented models signals a paradigm shift for clinical AI: from static pattern recognition toward dynamic, context-aware reasoning supported by historical clinical evidence.
Looking forward, the research team is embedding this tool in Mount Sinai’s electronic health records system to pilot test its operational viability. Ambitions for future iterations include expanding the coding assistance beyond primary diagnoses to encompass secondary and procedural codes prevalent in diverse medical settings. The incorporation of more complex coding structures could unlock even greater efficiencies and clinical benefits across hospital departments.
This study also underscores the rising impact of AI even within resource-constrained environments, where smaller or open-source language models, when enhanced with retrieval capabilities, can achieve competitive performance. Such democratization bodes well for healthcare systems worldwide, promising cost-effective and transparent technology that does not compromise on quality.
At the heart of these advancements lies the remarkable interdisciplinary collaboration underpinned by Mount Sinai’s Windreich Department of AI and Human Health and the Hasso Plattner Institute for Digital Health. This pioneering synergy unites AI expertise, computational resources, and medical insight to drive innovative healthcare solutions. The department has a track record of leveraging machine learning for high-impact clinical tools, including award-winning applications that accelerate malnutrition diagnosis and resource allocation, demonstrating practical cutting-edge AI’s real-world potential.
This study epitomizes the trajectory of AI in medicine: steadily moving from theoretical promise to practical, patient-centered applications. By embedding AI in workflows, reducing administrative overhead, and enhancing data quality, the healthcare ecosystem stands to benefit profoundly—from clinicians gaining more time for patient interaction to health systems optimizing resource use and billing accuracy. Ultimately, such technologies aim to strengthen the humanistic core of medicine, empowering providers to deliver attentive, compassionate care with the support of intelligent digital allies.
Subject of Research: People
Article Title: Assessing Retrieval-Augmented Large Language Models for Medical Coding
News Publication Date: 25-Sep-2025
Web References: https://icahn.mssm.edu/about/artificial-intelligence
References: DOI: 10.1056/AIcs2401161
Keywords: Generative AI, Artificial intelligence, Medical coding, ICD codes, Retrieval-augmented models
Tags: AI in medical diagnosis codingAI vs human coders in healthcarediagnostic code assignment challengesenhancing accuracy in healthcare documentationimproving clinical workflow with AIInternational Classification of Diseases codinglarge language models in healthcarelookup integration in AI modelsMount Sinai Health System AI researchreducing billing errors with AIreflective coding mechanisms in AItransformative approaches in medical coding