Abstract
Aim:
Artificial intelligence (AI) has significantly influenced healthcare, enhancing diagnostic and therapeutic capabilities. This study evaluates the effectiveness of an AI-generated output within actual clinical environments, analyzing its precision compared to conventional interpretation techniques.
Methods:
A cross-sectional observational study assessed the reliability of the VELMENI AI platform in detecting dental issues on panoramic radiographs. Three hundred radiographs from the Sibar Institute of Dental Sciences were used, with four experienced readers trained on the AI platform. Each reader independently identified caries, restorations, and prostheses using the AI system. Diagnoses by dentists and the AI tool were compared, ensuring rigorous analysis and ethical standards.
Results:
This study examined the agreement between four human observers and an AI system in assessing caries, fixed prostheses, and restorations using Cohen’s weighted kappa. High reliability was found among the human observers, with the AI system demonstrating even greater consistency. The results were statistically significant, demonstrating strong agreement. Fleiss’ multi-rater kappa confirmed high overall agreement among all five raters. However, moderate agreement in caries assessment highlighted the need for enhanced training and guidelines.
Conclusions:
This study underscores AI’s potential in dental diagnostics, excelling in fixed prosthesis assessment while facing challenges in caries detection. Improved training and datasets are required for better clinician capabilities. The findings suggest AI-human collaboration is a promising future direction for dental diagnostics.
Keywords
Artificial intelligence, panoramic radiography, diagnosis, deep learning, fixed prosthesis, dental caries, restorationsIntroduction
Accurate diagnosis is one of the most critical responsibilities in dental practice, requiring a combination of clinical expertise and advanced diagnostic tools. Dental panoramic radiography (DPR), also known as orthopantomography, is the most prevalent extraoral imaging technique, providing a two-dimensional representation of all teeth, the mandible, the maxilla (including maxillary sinuses), and temporomandibular joints [1, 2]. Its advantages include quick imaging, low radiation exposure, and minimal patient discomfort, making it a preferred diagnostic tool in many dental assessments. Moreover, DPR allows simultaneous imaging of multiple structures, enabling the detection of numerous physiological and pathological conditions with a lower radiation dose [1, 2].
However, the interpretation of orthopantomograms (OPGs) heavily depends on the operator’s knowledge and expertise. Factors such as image enlargement, geometric distortions, unequal magnification, and superimpositions can complicate analysis, potentially leading to misinterpretations and diagnostic errors [3]. Additionally, comprehensive analysis can be time-consuming and susceptible to evaluator bias due to varying levels of clinical experience [3]. High-quality radiographs are essential not only for accurate human diagnoses but also for the development of machine learning (ML) models that can assist dentists in their practice [4]. To address these challenges, automated diagnostic systems can significantly enhance diagnostic accuracy and efficiency in routine dental practice [4].
In recent years, artificial intelligence (AI) has significantly advanced healthcare, especially in medical imaging. By leveraging ML and deep learning (DL), AI enhances diagnostic accuracy and supports clinicians in making informed decisions. A key subset of AI is ML, which identifies patterns in data to predict outcomes [5].
DL, a specialized ML technique, employs deep neural networks (DNNs) that excel at analyzing complex data structures like medical images. These networks consist of interconnected neurons organized into layers, enabling them to recognize hierarchical features such as edges, shapes, and patterns more effectively than traditional ML models [6].
Convolutional neural networks (CNNs), a type of DL model, are particularly effective for medical image analysis. Designed to process large and complex image datasets, CNNs can automatically detect, segment, and classify patterns within 2D and 3D dental images. This automation improves diagnostic speed, accuracy, and reproducibility compared to manual interpretation [6]. Notably, AI-driven tools have shown high performance in DPR analysis, achieving accuracy rates of around 90% [7]. AI models can generate diagnostic reports within seconds, significantly reducing the average analysis time compared to traditional methods, which can exceed 8 minutes per image [8].
Furthermore, AI is composed of multiple subfields, including ML, DL, cognitive computing, natural language processing, robotics, expert systems, and fuzzy logic [9]. ML models allow for automated learning without explicit programming, aiming to enhance decision-making while reducing human involvement. AI systems can analyze existing data to predict potential future outcomes, offering valuable insights for improved diagnostics and clinical workflow efficiency [9].
Recent studies (2023–2024) have demonstrated the growing role of AI in radiographic interpretation. AI-powered software solutions, such as VELMENI AI, have shown promise in enhancing diagnostic consistency and accuracy. Ezhov et al. (2021) [10] underscores the potential of AI-driven diagnostic tools in dentistry, highlighting their increasing clinical adoption.
The primary objective of this study is to evaluate the reliability of VELMENI software in automatically identifying key dental features, including teeth, dental caries, implants, restorations, and fixed prostheses. By leveraging AI technology, VELMENI aims to improve diagnostic precision, support early detection of dental conditions, and enhance patient outcomes. Additionally, integrating AI tools can help dental professionals allocate more time to patient care by reducing the need for time-consuming manual analyses.
This research will assess VELMENI AI’s performance in real-world clinical settings, comparing its diagnostic accuracy to traditional methods. The findings are expected to contribute to the growing body of evidence supporting AI’s role in dentistry, promoting its broader adoption in clinical diagnostics.
Materials and methods
Study design
A cross-sectional observational study was conducted to assess the reliability of the VELMENI AI platform for automatically detecting dental caries, restorations, and fixed prostheses on panoramic radiographs. The study design received ethical approval from the Sibar Institute of Dental Sciences Institutional Ethical Committee (Approval No. Pr.172/IEC/SIBAR/2022-27/12/2022).
Study setting and participants
Image selection: Three hundred anonymized panoramic radiographs were retrieved from the archives of the Radiology Department at the Sibar Institute of Dental Sciences, Guntur. The selected radiographs were diagnostically acceptable and free from artifacts, and they included images from subjects aged over 15 years. Inclusion criteria required that the images provided a clear view of the dental structures for caries, restorations, and fixed prostheses detection.
Dataset characteristics: In the set of 300 panoramic radiographs, the AI prediction showed 754 instances of caries, 177 instances of restorations, and 244 instances of fixed prostheses.
Reader selection
Four experienced Oral Medicine and Radiology specialists, each with a minimum of two years of clinical experience, were included in the study. The radiologists underwent a comprehensive training session on using the VELMENI AI platform prior to the study, ensuring consistent application of the platform across all images.
Image interpretation
The anonymized panoramic radiographs were assigned to the four readers, who independently reviewed the images on the VELMENI AI platform. Each reader was tasked with identifying the presence of dental caries, fixed prostheses, and restorations. The radiographs were annotated to highlight (Figures 1 and 2).

Restoration and dental caries detection on a panoramic radiograph using the deep learning algorithm. Blue color indicates caries, pink color indicates restoration

Fixed prosthesis, and dental caries detection on a panoramic radiograph using the deep learning algorithm. Blue color indicates caries, yellow color indicates fixed prosthesis
The number of teeth with caries
The number of teeth with restorations
The number of teeth with fixed dental prostheses
AI system utilization
The VELMENI AI platform, based on a CNN architecture, was utilized to automatically detect dental caries, restorations, and fixed dental prostheses in the panoramic radiographs. The platform is powered by VELMENI, the first dental AI to be approved by the FDA for panoramic, periapical, and bitewing radiographs.
Statistical analysis
The data were analyzed to assess:
Inter-rater reliability: The consistency of diagnoses among the four readers was evaluated using Cohen’s weighted kappa for pairwise comparisons and Fleiss’ multi-rater kappa for assessing agreement across all five raters.
Agreement between dentists and AI: The level of agreement between the diagnoses made by the four dentists and those generated by the VELMENI AI system was evaluated using Cohen’s weighted kappa and accuracy, precision, recall, and F1 score for caries, restorations, and fixed prostheses detection.
Statistical software: The analysis was performed using SPSS Version 26 (IBM Corp, USA). The results were interpreted based on standard thresholds for kappa values, where a kappa value greater than 0.75 was considered excellent, 0.40 to 0.75 was considered moderate to good, and less than 0.40 indicated poor agreement.
Results
This study investigated inter-rater reliability among four human observers and an AI system using Cohen’s weighted kappa, focusing on evaluations of caries, fixed prostheses, and restorations.
The analysis included evaluations of fixed prostheses and restorations, revealing weighted kappa values ranging from 0.866 to 0.959 across various pairings (Tables 1 and 2). The findings indicate a high level of reliability among human observers, with the AI system demonstrating even greater consistency in its assessments. The statistical significance of these results was robust, as evidenced by p-values of less than 0.001, suggesting that the observed agreements are highly unlikely to be due to chance.
Inter-rater reliability for fixed prosthesis
Cohen’s weighted kappa | ||||||
---|---|---|---|---|---|---|
Ratings | Weighted kappaa | Asymptotic | 95% Asymptotic confidence interval | |||
Standard errorb | zc | Sig. | Lower bound | Upper bound | ||
observer 1–observer 2 | 0.878 | 0.021 | 22.444 | < 0.001 | 0.837 | 0.918 |
observer 1–observer 3 | 0.866 | 0.022 | 22.178 | < 0.001 | 0.823 | 0.909 |
observer 1–observer 4 | 0.894 | 0.020 | 23.123 | < 0.001 | 0.855 | 0.934 |
observer 1–AI | 0.906 | 0.019 | 23.347 | < 0.001 | 0.868 | 0.944 |
observer 2–observer 3 | 0.922 | 0.017 | 23.318 | < 0.001 | 0.888 | 0.956 |
observer 2–observer 4 | 0.920 | 0.017 | 23.550 | < 0.001 | 0.886 | 0.954 |
observer 2–AI | 0.943 | 0.015 | 24.197 | < 0.001 | 0.913 | 0.974 |
observer 3–observer 4 | 0.935 | 0.015 | 23.894 | < 0.001 | 0.906 | 0.963 |
observer 3–AI | 0.939 | 0.014 | 24.141 | < 0.001 | 0.912 | 0.966 |
observer 4–AI | 0.959 | 0.012 | 24.725 | < 0.001 | 0.935 | 0.983 |
a The estimation of the weighted kappa uses linear weights; b value does not depend on either null or alternative hypotheses; c estimates the asymptotic standard error assuming the null hypothesis that weighted kappa is zero. Sig: significance
Showing inter-rater reliability for restorations
Cohen’s weighted kappa | ||||||
---|---|---|---|---|---|---|
Ratings | Weighted kappaa | Asymptotic | 95% Asymptotic confidence interval | |||
Standard errorb | zc | Sig. | Lower bound | Upper bound | ||
observer 1–observer 2 | 0.878 | 0.021 | 22.444 | < 0.001 | 0.837 | 0.918 |
observer 1–observer 3 | 0.866 | 0.022 | 22.178 | < 0.001 | 0.823 | 0.909 |
observer 1–observer 4 | 0.894 | 0.020 | 23.123 | < 0.001 | 0.855 | 0.934 |
observer 1–AI | 0.906 | 0.019 | 23.347 | < 0.001 | 0.868 | 0.944 |
observer 2–observer 3 | 0.922 | 0.017 | 23.318 | < 0.001 | 0.888 | 0.956 |
observer 2–observer 4 | 0.920 | 0.017 | 23.550 | < 0.001 | 0.886 | 0.954 |
observer 2–AI | 0.943 | 0.015 | 24.197 | < 0.001 | 0.913 | 0.974 |
observer 3–observer 4 | 0.935 | 0.015 | 23.894 | < 0.001 | 0.906 | 0.963 |
observer 3–AI | 0.939 | 0.014 | 24.141 | < 0.001 | 0.912 | 0.966 |
observer 4–AI | 0.959 | 0.012 | 24.725 | < 0.001 | 0.935 | 0.983 |
a The estimation of the weighted kappa uses linear weights; b value does not depend on either null or alternative hypotheses; c estimates the asymptotic standard error assuming the null hypothesis that weighted kappa is zero. Sig: significance
To further assess the overall inter-rater reliability among all five raters, Fleiss’ multi-rater kappa was employed for fixed prosthesis and restorations, yielding a kappa value of 0.837. This high value underscores the strong agreement among the raters, reinforcing the reliability of the assessments (Tables 3 and 4).
Overall agreement for evaluation of fixed prosthesis
Fleiss’ multi-rater kappa | ||||||
---|---|---|---|---|---|---|
Rating | Kappa | Asymptotic | 95% Asymptotic confidence interval | |||
Standard error | z | Sig. | Lower bound | Upper bound | ||
Overall agreementa | 0.837 | 0.012 | 67.055 | < 0.001 | 0.813 | 0.862 |
a Sample data contains 300 effective subjects and 5 raters. Sig: significance
Overall agreement for restorations
Fleiss’ multi-rater kappa | ||||||
---|---|---|---|---|---|---|
Rating | Kappa | Asymptotic | 95% Asymptotic confidence interval | |||
Standard error | z | Sig. | Lower bound | Upper bound | ||
Overall agreementa | 0.837 | 0.012 | 67.055 | < 0.001 | 0.813 | 0.862 |
a Sample data contains 300 effective subjects and 5 raters. Sig: significance
In the caries evaluation specifically, an almost perfect agreement was found among observers 1, 2, and 4, with a kappa value of 0.844 and 0.838. There was also substantial agreement between observer 1 and observer 3, reflected in a kappa value of 0.784. When comparing the human observers to the AI system, almost perfect agreement was evident across all pairings, with kappa values of 0.903, 0.903, 0.845, and 0.884 for observers 1, 2, 3, and 4, respectively, and all results were statistically significant (p < 0.001) (Table 5).
Showing inter-rater reliability for caries
Cohen’s weighted kappa | ||||||
---|---|---|---|---|---|---|
Ratings | Weighted kappaa | Asymptotic | 95% Asymptotic confidence interval | |||
Standard errorb | zc | Sig. | Lower bound | Upper bound | ||
observer 1–observer 2 | 0.844 | 0.016 | 24.649 | < 0.001 | 0.814 | 0.875 |
observer 1–observer 3 | 0.784 | 0.019 | 22.818 | < 0.001 | 0.747 | 0.821 |
observer 1–observer 4 | 0.838 | 0.017 | 24.389 | < 0.001 | 0.804 | 0.872 |
observer 1–AI | 0.903 | 0.013 | 26.418 | < 0.001 | 0.877 | 0.928 |
observer 2–observer 3 | 0.789 | 0.020 | 22.862 | < 0.001 | 0.750 | 0.827 |
observer 2–observer 4 | 0.821 | 0.019 | 23.726 | < 0.001 | 0.784 | 0.858 |
observer 2–AI | 0.903 | 0.013 | 26.217 | < 0.001 | 0.877 | 0.928 |
observer 3–observer 4 | 0.787 | 0.019 | 22.617 | < 0.001 | 0.748 | 0.825 |
observer 3–AI | 0.845 | 0.018 | 24.510 | < 0.001 | 0.808 | 0.881 |
observer 4–AI | 0.884 | 0.016 | 25.381 | < 0.001 | 0.853 | 0.916 |
a The estimation of the weighted kappa uses linear weights; b value does not depend on either null or alternative hypotheses; c estimates the asymptotic standard error assuming the null hypothesis that weighted kappa is zero. Sig: significance
However, the analysis using Fleiss’ multi-rater kappa, which assessed overall agreement among the five raters evaluating 300 radiographic images in the caries assessment, revealed a kappa value of 0.607, indicating only moderate agreement (Table 6). This suggests some variability in diagnostic decision-making among the clinicians. Such variability underscores the necessity for improved training and clearer guidelines for clinicians in caries diagnosis.
Overall agreement for caries
Fleiss’ multi-rater kappa | ||||||
---|---|---|---|---|---|---|
Rating | Kappa | Asymptotic | 95% Asymptotic confidence interval | |||
Standard error | z | Sig. | Lower bound | Upper bound | ||
Overall agreementa | 0.607 | 0.008 | 79.324 | < 0.001 | 0.592 | 0.622 |
a Sample data contains 300 effective subjects and 5 raters. Sig: significance
Discussion
The advent of AI in dentistry marks a significant leap forward, promising to revolutionize diagnostic practices and elevate the standard of clinical care. This study investigated the performance of a novel AI tool designed to identify a range of dental pathologies, including caries, fixed prostheses, and restorations. Our findings demonstrate the remarkable accuracy and efficiency of this AI tool, highlighting its potential to transform everyday dental diagnostics.
The AI tool, developed using a DL model based on a CNN architecture, exhibited near-perfect sensitivity and precision in detecting teeth and identifying dental pathologies [3]. Importantly, the tool maintained consistent performance across images acquired from various dental imaging devices, indicating its robustness and adaptability for diverse clinical settings [11].
A key finding of this study was the AI system’s superior consistency in assessments compared to human observers. This aligns with emerging evidence suggesting that AI can serve as a powerful decision-support mechanism, particularly in regions facing a shortage of trained dental professionals. The high degree of agreement between the AI system and human experts, evidenced by kappa values ranging from 0.866 to 0.959, further underscores the reliability and potential of AI in augmenting human diagnostic capabilities.
While the AI tool excelled in identifying well-defined structures such as fixed prostheses and restorations, consistent with findings by Ezhov et al. (2021) [10], challenges persist in the accurate detection of caries. This reflects the inherent difficulty in differentiating overlapping structures and navigating geometric distortions common in panoramic radiographs, as noted in previous research. To enhance caries detection accuracy, future efforts should focus on expanding AI training datasets to encompass a broader spectrum of caries presentations and refining DL algorithms to better discern subtle radiographic features [3].
Another important consideration is the application of AI in pediatric dental diagnostics. The anatomical and developmental variability in children’s dentition introduces unique challenges. The continuous growth and exfoliation of primary teeth, along with the mixed dentition phase, can impact AI model performance. AI systems trained primarily on adult radiographs may not generalize well to pediatric cases, leading to potential misinterpretations.
Furthermore, ethical and regulatory challenges must be carefully managed, particularly regarding data privacy, informed consent, and bias in AI models. Pediatric imaging datasets are often underrepresented in AI training, making it crucial to develop pediatric-specific models with diverse and well-validated datasets. Studies, such as Turosz et al. (2024) [12], emphasize the need for age-specific AI training datasets and validation protocols to enhance AI accuracy and reliability in younger patients.
To improve AI applicability in pediatric dentistry, future research should focus on dataset expansion, algorithmic refinement, and enhanced clinician-AI collaboration. Addressing these issues will help AI become a more effective tool in pediatric dental diagnostics while maintaining high ethical and clinical standards.
Apart from technological progress, the ethical aspects of AI implementation in healthcare require thorough evaluation. Issues such as AI bias, data privacy, and regulatory adherence, as highlighted by Albano et al. (2024) [9] and Hung et al. (2020) [13], call for proactive strategies to facilitate responsible AI adoption. Ensuring compliance with established data protection regulations, including HIPAA and GDPR, remains crucial. AI models are anticipated to provide accurate insights and support clinical decision-making. They have the potential to transform dentistry by improving patient care, fostering innovation, and advancing research. With the capability to process vast amounts of data and make well-informed decisions, AI can efficiently manage tasks of varying complexity, from routine procedures to intricate analyses. Additionally, ongoing research should prioritize the development of strategies to enhance AI interpretability and strengthen clinician confidence in automated diagnostic systems [9, 14].
This study acknowledges limitations in sample size and diversity, which may impact the generalizability of the findings. As highlighted in prior research, comprehensive and diverse training datasets are crucial for developing robust and adaptable AI models. Future studies should prioritize larger sample sizes, standardized training protocols, and the inclusion of diverse patient demographics and imaging techniques to validate and refine AI performance across a wider range of clinical scenarios.
In conclusion, this study provides compelling evidence for the transformative potential of AI in dental diagnostics. While challenges remain, particularly in caries detection and ensuring dataset diversity, the trajectory of AI in dentistry is undeniably promising. As AI technology continues to evolve, it is poised to become an indispensable tool for clinicians, enabling more accurate, efficient, and accessible diagnoses, ultimately leading to improved patient care and oral health outcomes.
Abbreviations
AI: | artificial intelligence |
CNNs: | convolutional neural networks |
DL: | deep learning |
DPR: | dental panoramic radiography |
ML: | machine learning |
Declarations
Acknowledgments
The authors would like to express their gratitude to the VELMENI, Inc. and to all the readers who have contributed to the study.
Author contributions
SY: Conceptualization, Investigation, Writing—original draft, Writing—review & editing. TC: Validation, Writing—original draft, Writing—review & editing. PRNN and SS: Validation, Writing—review & editing, Supervision. PK and SM: Validation, Writing—review & editing. PH: Writing—review & editing. All authors read and approved the submitted version.
Conflicts of interest
The authors declare no conflicts of interest to report regarding the present study.
Ethical approval
Sibar Institute of Dental Sciences Institutional Ethical Committee gave ethical approval. IEC Number: (Pr.172/IEC/SIBAR/2022-27/12/2022).
Consent to participate
According to the SIBAR Institute of Dental Sciences Institutional Ethical Committee, anonymised and de-identified images do not require patient consent.
Consent to publication
Not applicable.
Availability of data and materials
The datasets for this manuscript are not publicly available because of privacy/ethical restriction. Requests for accessing the datasets should be directed to the corresponding authors.
Funding
Not applicable.
Copyright
© The Author(s) 2025.
Publisher’s note
Open Exploration maintains a neutral stance on jurisdictional claims in published institutional affiliations and maps. All opinions expressed in this article are the personal views of the author(s) and do not represent the stance of the editorial team or the publisher.