• HOME
  • NEWS
  • EXPLORE
    • CAREER
      • Companies
      • Jobs
    • EVENTS
    • iGEM
      • News
      • Team
    • PHOTOS
    • VIDEO
    • WIKI
  • BLOG
  • COMMUNITY
    • FACEBOOK
    • INSTAGRAM
    • TWITTER
Thursday, May 7, 2026
BIOENGINEER.ORG
No Result
View All Result
  • Login
  • HOME
  • NEWS
  • EXPLORE
    • CAREER
      • Companies
      • Jobs
        • Lecturer
        • PhD Studentship
        • Postdoc
        • Research Assistant
    • EVENTS
    • iGEM
      • News
      • Team
    • PHOTOS
    • VIDEO
    • WIKI
  • BLOG
  • COMMUNITY
    • FACEBOOK
    • INSTAGRAM
    • TWITTER
  • HOME
  • NEWS
  • EXPLORE
    • CAREER
      • Companies
      • Jobs
        • Lecturer
        • PhD Studentship
        • Postdoc
        • Research Assistant
    • EVENTS
    • iGEM
      • News
      • Team
    • PHOTOS
    • VIDEO
    • WIKI
  • BLOG
  • COMMUNITY
    • FACEBOOK
    • INSTAGRAM
    • TWITTER
No Result
View All Result
Bioengineer.org
No Result
View All Result
Home NEWS Science News Technology

Study Reveals AI Language Models Encounter Challenges with Basic Hospital Data Tasks

Bioengineer by Bioengineer
May 7, 2026
in Technology
Reading Time: 4 mins read
0
Study Reveals AI Language Models Encounter Challenges with Basic Hospital Data Tasks — Technology and Engineering
Share on FacebookShare on TwitterShare on LinkedinShare on RedditShare on Telegram

A recent investigation into the practical capabilities of large language models (LLMs) reveals significant limitations in their use for routine administrative tasks within hospital environments. Conducted by Eyal Klang and colleagues at the Icahn School of Medicine at Mount Sinai in New York, the study critically evaluates the performance of state-of-the-art LLMs on essential number-crunching operations that healthcare administrators depend on daily. Published in PLOS Digital Health, these findings provide essential technical insights into the challenges facing AI implementation in clinical administrative workflows.

Hospitals nowadays rely heavily on electronic health records (EHRs) – structured datasets that capture patient information, resource availability, and care events. Administrators utilize this data to monitor patient loads, allocate resources, and generate operational reports. Traditionally, these tasks are performed by specialized data analysts deploying programming languages and database queries, a process often fraught with delays when rapid answers are needed for decision-making. The promise of LLMs like GPT-4o and Llama has been to democratize data access by allowing non-technical staff to query these datasets directly using natural language prompts.

In the study, researchers subjected nine leading LLMs to a rigorous battery of tests designed to emulate two foundational administrative functions: counting how many patients meet a specific clinical condition and filtering records based on multiple inclusion criteria simultaneously. The data itself was sourced from a substantial real-world dataset of over 50,000 emergency department visits within the Mount Sinai Health System, grounding the evaluation in practical, messy clinical data rather than synthetic or simplified examples.

The initial experiments employed straightforward prompting techniques, where models were simply asked direct questions such as “How many patients were admitted from this table?” Across the board, all tested LLMs demonstrated subpar accuracy, failing to provide reliable answers when handling these structured queries. This underlines a fundamental disconnect between LLM training—and their practical applicability to real-world numerical and logical operations in healthcare datasets.

To enhance performance, the researchers explored a chain-of-thought prompting approach. This method instructs the model to transparently reason through the problem step-by-step before arriving at the final answer, theoretically enabling more accurate and consistent outputs. However, the results were underwhelming; only modest improvements were observed on smaller tables, and as the size and complexity of the data increased, accuracy declined precipitously. For instance, even GPT-4o, the best performing model under this regime, saw accuracy plummet from approximately 95% on small datasets to below 60% when confronted with larger tables.

Recognizing that prompting alone may not suffice, the research shifted focus to a tool-based model execution approach. Here, LLMs were tasked with generating executable code, such as SQL or Python scripts, to process the data programmatically. This method leverages the LLM’s natural language understanding to translate queries into precise machine-readable commands, which are then run directly against the EHR data for guaranteed accuracy. Impressively, this approach substantially improved results for the most advanced models. GPT-4o and Qwen-2.5-72B demonstrated near-perfect accuracy under these conditions, successfully navigating the intricacies of complex filters and large datasets.

Despite these successes, not all models fared well. LLMs optimized for speed and efficiency, such as distilled variants of DeepSeek, struggled to produce usable outputs even when provided with the ability to generate and run code. Furthermore, the Llama-3.1-8B model encountered major difficulties, failing to produce functional results in the majority of assessments and being ultimately excluded from further analysis. These discrepancies highlight the diverse capabilities within the current LLM ecosystem and caution against broad assumptions regarding their utility in structured data environments.

The study’s findings carry critical implications for the future deployment of LLMs in healthcare administration. Benjamin Glicksberg, one of the authors, emphasized that without integrating tool-based strategies—combining LLM-generated code with actual execution—large language models remain fundamentally unsuitable for standalone use in clinical administrative settings. Clinical workflows frequently involve complex structured data requiring absolute reliability and precision, conditions under which straightforward natural language query processing by LLMs falls short.

Moreover, the requirement for “agentic” approaches is underscored by this work. Agentic AI involves systems that act semi-autonomously, leveraging external tools and code execution capabilities to ensure results remain consistent and verifiable. By integrating LLMs with backend code execution engines, hospitals could dramatically accelerate administrative processes while maintaining data integrity. Such hybrid solutions may bridge the gap between cutting-edge AI capabilities and the stringent accuracy demands of healthcare operations.

This study shines a spotlight on the often-overlooked challenges of applying AI in clinical data environments. While the hype around LLMs centers on their conversational fluency and general knowledge, the ability to perform precise numerical computations and filtered data retrieval within complex EHR systems requires a fundamentally different kind of model reliability. The researchers’ meticulous experimental design and real-world data usage offer a vital reality check for the healthcare sector’s ongoing AI ambitions.

Lastly, the authors note that their work did not receive any external funding, and no competing interests were declared. The open-access publication ensures that the full details, along with extensive methodological descriptions and results, remain available to researchers, clinicians, and AI developers aiming to advance safe and effective AI integration into hospital administration.

Overall, these findings caution healthcare providers and AI developers alike to calibrate expectations around LLMs’ current abilities in administrative contexts. They also highlight the powerful potential unlocked by hybrid human-AI systems that combine natural language understanding with robust programming and execution frameworks. As digital healthcare continues to evolve, researchers and practitioners will need to navigate these complex trade-offs to harness AI’s benefits without compromising accuracy and trustworthiness.

Web References:
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0001326

Subject of Research: Not applicable
Article Title: Large language models are poor clinical administrators: An evaluation of structured queries in real-world electronic health records
News Publication Date: 7-May-2026
References: Klang E, Sorin V, Korfiatis P, Sawant AS, Freeman R, Charney AW, et al. (2026) Large language models are poor clinical administrators: An evaluation of structured queries in real-world electronic health records. PLOS Digit Health 5(5): e0001326. DOI: 10.1371/journal.pdig.0001326

Keywords

Large Language Models, Electronic Health Records, Clinical Administration, Artificial Intelligence, GPT-4o, Tool-based AI, Chain-of-Thought Prompting, Healthcare Data Analytics, AI Reliability, Code Generation, Hospital Resource Management, Clinical Workflow Automation

Tags: administrative data tasks in hospitalsAI for patient load monitoringAI language models in healthcareAI performance on hospital resource allocationchallenges of AI in hospital administrationdemocratizing data access in healthcareEHR data querying by non-technical staffGPT-4o and Llama in medical data analysishealthcare operational reporting with AIlarge language models for electronic health recordslimitations of LLMs in clinical workflowsnatural language processing for healthcare data

Share12Tweet7Share2ShareShareShare1

Related Posts

Engineered Biochar: A Sustainable Solution for Capturing Carbon Dioxide — Technology and Engineering

Engineered Biochar: A Sustainable Solution for Capturing Carbon Dioxide

May 7, 2026
Mapping Children’s Cancer Palliative Care Needs Holistically — Technology and Engineering

Mapping Children’s Cancer Palliative Care Needs Holistically

May 7, 2026

High-Performance Optoelectronics via Thin-Film Perovskites

May 7, 2026

Adipocytokine Imbalance Affects Bone Health in Hispanic Youth

May 7, 2026

POPULAR NEWS

  • Research Indicates Potential Connection Between Prenatal Medication Exposure and Elevated Autism Risk

    837 shares
    Share 335 Tweet 209
  • New Study Reveals Plants Can Detect the Sound of Rain

    725 shares
    Share 290 Tweet 181
  • Scientists Investigate Possible Connection Between COVID-19 and Increased Lung Cancer Risk

    68 shares
    Share 27 Tweet 17
  • Salmonella Haem Blocks Macrophages, Boosts Infection

    61 shares
    Share 24 Tweet 15

About

We bring you the latest biotechnology news from best research centers and universities around the world. Check our website.

Follow us

Recent News

Triose Phosphate Isomerase 1 Rewires Microglial Metabolism

Nationwide Study Aims to Enhance Sleep Quality in ICU Patients

Rebecca T. Hahn, MD, to Receive TCT 2026 Master Operator Award

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 82 other subscribers
  • Contact Us

Bioengineer.org © Copyright 2023 All Rights Reserved.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Homepages
    • Home Page 1
    • Home Page 2
  • News
  • National
  • Business
  • Health
  • Lifestyle
  • Science

Bioengineer.org © Copyright 2023 All Rights Reserved.