Applying large language model for automated quality scoring of radiology requisitions using a standardized criteria

dc.authorid0009-0005-5314-2293
dc.authorid0000-0002-6989-6613
dc.authorid0000-0002-2150-4756
dc.authorid0000-0002-6157-6367
dc.authorid0000-0001-6824-4990
dc.contributor.authorBuyuktoka, Rasit Eren
dc.contributor.authorSurucu, Murat
dc.contributor.authorErekli Derinkaya, Pelin Berfin
dc.contributor.authorAdibelli, Zehra Hilal
dc.contributor.authorSalbas, Ali
dc.contributor.authorKoc, Ali Murat
dc.contributor.authorBuyuktoka, Asli Dilara
dc.date.accessioned2026-01-24T12:30:53Z
dc.date.available2026-01-24T12:30:53Z
dc.date.issued2025
dc.departmentAlanya Alaaddin Keykubat Üniversitesi
dc.description.abstractObjectivesTo create and test a locally adapted large language model (LLM) for automated scoring of radiology requisitions based on the reason for exam imaging reporting and data system (RI-RADS), and to evaluate its performance based on reference standards.Materials and methodsThis retrospective, double-center study included 131,683 radiology requisitions from two institutions. A bidirectional encoder representation from a transformer (BERT)-based model was trained using 101,563 requisitions from Center 1 (including 1500 synthetic examples) and externally tested on 18,887 requisitions from Center 2. The model's performance for two different classification strategies was evaluated by the reference standard created by three different radiologists. Model performance was assessed using Cohen's Kappa, accuracy, F1-score, sensitivity, and specificity with 95% confidence intervals.ResultsA total of 18,887 requisitions were evaluated for the external test set. External testing yielded a performance with an F1-score of 0.93 (95% CI: 0.912-0.943); kappa = 0.88 (95% CI: 0.871-0.884). Performance was highest in common categories RI-RADS D and X (F1 >= 0.96) and lowest for rare categories RI-RADS A and B (F1 <= 0.49). When grouped into three categories (adequate, inadequate, and unacceptable), overall model performance improved [F1-score = 0.97; (95% CI: 0.96-0.97)].ConclusionThe locally adapted BERT-based model demonstrated high performance and almost perfect agreement with radiologists in automated RI-RADS scoring, showing promise for integration into radiology workflows to improve requisition completeness and communication.Key PointsQuestionCan an LLM accurately and automatically score radiology requisitions based on standardized criteria to address the challenges of incomplete information in radiological practice?FindingsA locally adapted BERT-based model demonstrated high performance (F1-score 0.93) and almost perfect agreement with radiologists in automated RI-RADS scoring across a large, multi-institutional dataset.Clinical relevanceLLMs offer a scalable solution for automated scoring of radiology requisitions, with the potential to improve workflow in radiology. Further improvement and integration into clinical practice could enhance communication, contributing to better diagnoses and patient care.Key PointsQuestionCan an LLM accurately and automatically score radiology requisitions based on standardized criteria to address the challenges of incomplete information in radiological practice?FindingsA locally adapted BERT-based model demonstrated high performance (F1-score 0.93) and almost perfect agreement with radiologists in automated RI-RADS scoring across a large, multi-institutional dataset.Clinical relevanceLLMs offer a scalable solution for automated scoring of radiology requisitions, with the potential to improve workflow in radiology. Further improvement and integration into clinical practice could enhance communication, contributing to better diagnoses and patient care.Key PointsQuestionCan an LLM accurately and automatically score radiology requisitions based on standardized criteria to address the challenges of incomplete information in radiological practice?FindingsA locally adapted BERT-based model demonstrated high performance (F1-score 0.93) and almost perfect agreement with radiologists in automated RI-RADS scoring across a large, multi-institutional dataset.Clinical relevanceLLMs offer a scalable solution for automated scoring of radiology requisitions, with the potential to improve workflow in radiology. Further improvement and integration into clinical practice could enhance communication, contributing to better diagnoses and patient care.
dc.identifier.doi10.1007/s00330-025-11933-2
dc.identifier.issn0938-7994
dc.identifier.issn1432-1084
dc.identifier.pmid40836020
dc.identifier.scopus2-s2.0-105013751069
dc.identifier.scopusqualityQ1
dc.identifier.urihttps://doi.org/10.1007/s00330-025-11933-2
dc.identifier.urihttps://hdl.handle.net/20.500.12868/5508
dc.identifier.wosWOS:001553780600001
dc.identifier.wosqualityQ1
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.indekslendigikaynakPubMed
dc.language.isoen
dc.publisherSpringer
dc.relation.ispartofEuropean Radiology
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.snmzKA_WoS_20260121
dc.subjectRadiology
dc.subjectWorkflow
dc.subjectLarge language models
dc.subjectElectronic health records
dc.subjectNatural language processing
dc.titleApplying large language model for automated quality scoring of radiology requisitions using a standardized criteria
dc.typeArticle

Dosyalar