In recent years, the rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has opened new frontiers in multiple fields, including healthcare. A groundbreaking study led by researchers at Penn State has now provided a rigorous evaluation of how AI-powered chatbots respond to everyday health-related inquiries posed by the general public. The study uncovers that these AI systems achieve an accuracy rate of approximately 76% when addressing routine health questions, a figure that simultaneously highlights both the promise and the perils of deploying such technologies in real-world medical contexts.
The research uniquely focusses on the perspective of the average internet user, a group that frequently turns to AI as a modern-day symptom checker, reminiscent of how Google was traditionally used for preliminary health information. This user-centered approach is critical because prior studies predominantly examined LLMs from expert or academic lenses, often overlooking practical consumer interactions. By focusing on typical health queries submitted by laypersons, the study offers vital insights into the effectiveness and safety of AI-based medical advice in daily life.
To gather authentic data reflecting real-world usage, the research team organized an innovative event known as the “Diagnose-a-thon” at Penn State. This competition attracted 34 participants spanning faculty, staff, and students across various academic levels. Participants generated a substantial dataset of 212 health-related prompts, encompassing both genuine and hypothetical conditions, crafted from patient and clinician viewpoints. They then queried four distinct state-of-the-art LLMs: ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b. By allowing participants to select their preferred AI model without constraints, the study faithfully replicated the autonomous and diverse usage patterns found in natural settings.
An essential part of the study involved a rigorous evaluation stage where nine board-certified physicians assessed the treatments and information handed back by the LLMs. The evaluation metric was comprehensive, assessing both the clinical accuracy and the potential harm posed by the AI-generated answers, measured on a nuanced six-point scale from very low to very high. This detailed scoring system illuminated how AI diagnostic responses vary across medical specialties and contexts, a level of granularity rarely seen in previous AI investigations.
The findings showed a variable performance landscape across medical disciplines. Obstetrics, gynecology, and otolaryngology yielded the highest levels of correct information with minimal risks, showcasing scenarios where LLMs currently excel. Conversely, fields such as internal medicine, neurology, and dermatology demonstrated more significant challenges for AI systems, where inaccuracies and higher harm potentials were more prevalent. These results underscore an important reality: certain specialized medical domains demand more caution when leveraging AI tools, especially if these tools are employed by untrained individuals.
A fascinating specificity in the study revealed that prompts with a length between 60 and 250 characters tended to produce more accurate AI responses. This suggests that message framing and prompt articulation play crucial roles in steering AI models toward clinically valid outputs. Moreover, highly specialized or narrowly focused questions posed difficulties, suggesting that broad generalist models still face significant hurdles when addressing deeply technical or nuanced medical issues.
Beyond evaluating off-the-shelf AI models, the research team experimented with a novel augmentation approach by retraining the base LLMs using an extensive corpus of medical textbooks, clinical guidelines, and peer-reviewed literature typical of medical school curricula. The goal was to determine whether such domain-specific tuning could enhance clinical validity while reducing harmful outputs. Surprisingly, medical professionals and trainees reviewing these augmented models showed a preference for responses from the original Gemini and Llama bases over the retrained versions. No statistically significant preference was observed regarding ChatGPT’s base versus augmented models. This counterintuitive result suggests that current fine-tuning strategies may not straightforwardly translate into improved clinical communication by AI.
The implications of these findings are profound for the future integration of AI into healthcare delivery. As Dr. Jennifer Kraschnewski, a co-author of the study and a practicing physician, articulates, AI represents a transformative force with the potential to augment clinician capabilities rather than replace human doctors. The challenge lies in harnessing AI tools in ways that bolster medical professionals’ diagnostic processes, reduce cognitive burdens, and improve patient outcomes without exposing patients to the risks of AI errors in unsupervised contexts.
Crucially, the study emphasizes that despite satisfactory accuracy scores in the mid-70s percentage range, the AI models still exhibited an error rate exceeding 20%. This rate is approximately double that of human physicians and highlights the potential for AI to propagate misinformation leading to harm if used uncritically by patients themselves. Such statistical insights counsel for cautious and responsible deployment of AI technologies in healthcare, underscoring the necessity of preserving human clinical oversight.
The study also offers a nuanced view on AI’s evolving role: rather than supplanting the physician’s role, AI could serve as a catalyst to “upskill” clinicians by providing rapid evidence summaries, differential diagnosis suggestions, and decision support, streamlining care processes. The research community is thus encouraged to focus on developing AI systems tailored to professional use, with interfaces and interpretability tuned for clinical environments.
Penn State’s research ecosystem facilitated this multidisciplinary collaboration, bringing together expertise in informatics, intelligent systems, clinical medicine, and AI ethics. Their participatory research design, which mimics user autonomy and real-world interaction dynamics, sets a new methodological standard for evaluating AI systems in societally critical domains. It also expands the discourse on AI accountability and transparency by highlighting the tangible benefits and limitations observed when AI systems engage with health-related content.
Given the inevitable persistence of AI tools in healthcare, public education and digital literacy emerge as pivotal. The study’s co-authors advocate for initiatives that enhance consumer understanding of AI’s strengths and weaknesses in medical diagnosis. Such literacy efforts will empower users to critically appraise AI-generated advice, reducing overreliance and potential misuses.
In summary, this Penn State study, to be presented at the 2026 ACM Fairness, Accountability, and Transparency (FAccT) conference, offers a watershed moment in understanding how large language models intersect with everyday healthcare. Their findings resonate with a dual narrative: AI carries tremendous promise to revolutionize medical diagnostics and patient care when stewarded responsibly, but also harbors non-negligible risks, particularly if accessible without proper clinical guidance. As artificial intelligence advances, the path forward must balance innovation with prudence, ensuring these systems enhance rather than undermine the intricate art of medicine.
Subject of Research: Evaluation of large language models’ accuracy and safety in responding to everyday health-related queries by general users.
Article Title: Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases
News Publication Date: 25-Jun-2026
Web References:
10.48550/arXiv.2506.13805
2026 ACM FAccT Conference
References:
The study data is derived from peer evaluations by board-certified physicians, augmented training on medical textbooks and peer-reviewed articles, and participatory crowdsourced clinical cases generated during the Diagnose-a-thon event hosted by Penn State’s Center for Socially Responsible Artificial Intelligence.
Keywords
Generative AI, Artificial Intelligence, Large Language Models, Healthcare, Medical Diagnosis, Clinical Accuracy, AI Ethics, Doctor-Patient Relationship, AI Safety, Medical Informatics, Healthcare Technology, AI in Medicine
Tags: AI and patient information accuracyAI chatbot accuracy in health queriesAI in healthcareAI medical advice safetyconsumer-focused AI health toolshealthcare AI evaluation studylarge language models for medical advicelimitations of AI in medicinePenn State Diagnose-a-thon eventpublic engagement with health AIreal-world AI healthcare applicationssymptom checking with AI



