Evaluating AI Anatomy Segmentation Models Without Ground Truth Data

In the rapidly advancing field of medical imaging, the advent of artificial intelligence (AI) has revolutionized how vast collections of scans are analyzed. Automated anatomy segmentation—where AI models label organs and structures in images such as chest CT scans—has become a cornerstone for enabling large-scale studies previously infeasible due to the need for painstaking manual annotation. However, as the number of segmentation models multiplied, researchers have grappled with a fundamental challenge: how to objectively compare these AI tools in the absence of expert-verified ground truth.

A recent study published in the Journal of Medical Imaging has shed new light on this problem, proposing a robust and practical framework to evaluate concordance among different AI-based anatomy segmentation models without relying on expert annotations as a gold standard. This work centers on chest CT images sourced from the National Lung Screening Trial (NLST), a widely used public dataset for cancer research, ensuring high relevance and applicability to deployed clinical and research scenarios.

The dilemma stems from the nature of public datasets like NLST, which, despite containing thousands of imaging volumes, lack comprehensive organ and bone segmentations. Manual annotations for such intricate structures are astronomically time-consuming and require highly skilled radiologists, rendering complete ground truth labeling impractical. AI models can generate these labels automatically, yet disparity arises because each model may use different terminology, boundary definitions, or anatomical inclusion criteria. Without a consensus or external standard, pinpointing the superior model has remained a vexing conundrum.

Addressing this, the investigators embraced a paradigm shift: they evaluated AI segmentation tools based on their agreement rather than absolute accuracy. The hypothesis is elegant—if independently developed models concur in labeling a structure, that concordance likely indicates a reliable and valid anatomical segmentation. Rather than seeking the elusive “correct” answer, the study quantifies where AI tools align and where they diverge on the same dataset.

Achieving direct model comparison necessitated a standardized baseline. The researchers selected six prominent open-source segmentation models, including TotalSegmentator (two versions), Auto3DSeg, MOOSE, MultiTalent, and CADS. Despite their differing original output formats and nomenclature, the team harmonized all results by converting them into an interoperable DICOM segmentation standard. Furthermore, they unified labels using the SNOMED-CT vocabulary—a widely accepted medical ontology—assigning uniform color codes and identifiers to anatomical regions. This harmonization enabled side-by-side visualization of segmentations from different models on the very same scan, facilitating accurate comparison.

To enhance accessibility, the study leveraged two powerful open-source platforms widely embraced in medical imaging research: OHIF Viewer, a browser-based tool, and 3D Slicer, a robust desktop application. They extended these viewers with bespoke integrations and plugins capable of displaying multiple segmentations simultaneously in three-dimensional and orthogonal two-dimensional views. This user-friendly interface allows researchers to interactively explore congruence and discrepancies among models for individual organs and structures with unprecedented ease.

The analytic phase focused on a carefully curated subset of 18 chest CT scans from different NLST participants. After filtering out partially imaged or inconsistently detected anatomical structures, the study concentrated on 24 key regions, including lung lobes, the heart, ribs, thoracic vertebrae, and the sternum. For each structure, the authors identified a “consensus” segmentation defined as the voxel set concurrently labeled by all models recognizing that anatomical part. Subsequent comparisons measured how each model’s output overlapped with this consensus region, employing metrics quantified shape similarity and volumetric congruence.

These quantitative results were further distilled into interactive plots enabling rapid identification of outlier models or scans exhibiting problematic segmentations. Notably, the team released a publicly accessible interactive website to disseminate these findings, inviting the broader research community to examine the detailed concordance metrics and underlying imaging data themselves, fostering transparency and collaborative refinement.

Results illuminated variable performance across structures. Lung segmentation demonstrated remarkable agreement, with high overlap and nearly indistinguishable boundaries across all models. This consistency highlights the maturity of lung segmentation technologies—likely a function of abundant training data and well-defined anatomical landmarks. In contrast, heart segmentations initially showed moderate concordance owing primarily to one outlier model adopting a narrower definition of the heart. Excluding this model markedly improved overall alignment among the remainder.

Bone structures revealed greater challenges. Four of the six models manifested frequent errors in rib and thoracic vertebrae labels, including merges of adjacent bones or misidentification of vertebral levels. Conversely, two models trained on distinct datasets produced notably more consistent and anatomically comprehensive segmentations. These subtleties eluded aggregate statistics but emerged clearly through simultaneous visual scrutiny, underscoring the indispensability of combined quantitative and qualitative evaluation techniques.

This investigation underscores a crucial insight: even highly cited AI segmentation models can harbor systematic weaknesses, particularly when trained on overlapping or limited data. It also validates a novel pathway for meaningful model assessment without the prohibitive cost of manual ground truth annotation. By integrating standardized atlases, ontology-driven label harmonization, automated voxelwise comparison, and interactive visualization, this framework provides a reproducible, scalable solution for evaluating medical imaging AI tools.

Beyond its immediate findings, this work promotes a vital cultural shift in biomedical AI research—from chasing a mythical single “best” model to embracing evidence-based decision-making informed by comparative strengths and weaknesses. The open availability of software, label mappings, and sample datasets offers the community an invaluable toolkit applicable not only to chest CT anatomy but extensible to other modalities and segmentation tasks.

As AI becomes integral to clinical workflows and population-scale studies alike, transparent evaluation frameworks like this will be indispensable. They empower data scientists, clinicians, and researchers to select segmentation models thoughtfully, gauge reliability, and appreciate limitations—ultimately enhancing the trustworthiness and impact of AI in healthcare.

In a landscape increasingly reliant on AI-generated annotations, the study by L. Giebeler et al. pioneers a path that balances rigor with practicality. Their approach bridges methodological divides, nurtures collaboration, and elevates the standard of medical image analysis through collective truth-seeking, even when classical ground truths remain elusive.

Subject of Research: Not applicable
Article Title: In search of truth: evaluating concordance of AI-based anatomy segmentation models
News Publication Date: 3-Apr-2026
Web References:

https://www.spiedigitallibrary.org/journals/journal-of-medical-imaging/volume-13/issue-06/062204/In-search-of-truth–evaluating-concordance-of-AI-based/10.1117/1.JMI.13.6.062204.full
http://dx.doi.org/10.1117/1.JMI.13.6.062204
References:
Giebeler L., et al., “In search of truth: evaluating concordance of AI-based anatomy segmentation models,” Journal of Medical Imaging, 13(6), 062204 (2026).
Image Credits: L. Giebeler et al.
Keywords: Artificial intelligence, Medical imaging, Anatomy, Anatomy segmentation, AI evaluation, Chest CT, National Lung Screening Trial, Open-source models, DICOM segmentation, SNOMED-CT, 3D Slicer, OHIF Viewer

Tags: AI anatomy segmentation evaluationAI in radiologyAI model concordance assessmentautomated organ labelingchest CT scan segmentationcomparison of AI segmentation toolslarge-scale medical image analysismedical imaging AI modelsNational Lung Screening Trial datasetpublic medical imaging datasetsrobust evaluation framework for AIsegmentation without ground truth

Evaluating AI Anatomy Segmentation Models Without Ground Truth Data

Related Posts

Natural Hallucinogens: Evolution’s Ecological Tools, Not Mere Chemical Byproducts

This Famous Butterfly Revealed: Three Distinct Species Hidden in One

Scientists Attack Soybean Cyst Nematode by Starving Its Food Source

Decoding the Secret Code of a Crucial Immune Sensor

POPULAR NEWS

Saying Goodbye to PGY-6: Pediatric Fellowship Realities

Multi-Hospital Study Reveals Long Covid Burden Is Twice as High as Current Estimates

Detection of EDCs in Breast Milk and Infant Urine Up to Six Months Highlights Early Exposure Risks

New Drug Candidate Developed at McMaster Shows Potential for Treating Brain Cancer

About

Follow us

Recent News

Tracking Lanthanide-Labeled Microplastics in Plants

POSTECH Researchers Slash Cost of Reconstituted Cell-Free Systems by 95%

AI and Physics Collaborate to Design Advanced Hydrogen Storage Materials

Subscribe to Blog via Email

Welcome Back!

Retrieve your password