Bridging Gaps in Ophthalmology Education Through Large Language Models

Shahrzad Gholami; Beth Wilson; Sarah Page; Daniel B. Mummert; Joseph Carr; Robert R. McNabb; Rahul Dodhia; Juan M. Lavista Ferres; Bill Weeks; Dale E. Fajardo; Dale E. Fajardo

Bridging Gaps in Ophthalmology Education Through Large Language Models

Shahrzad Gholami ,
Beth Wilson ,
Sarah Page ,
Daniel B. Mummert ,
Joseph Carr ,
Robert R. McNabb ,
Rahul Dodhia ,
Juan M. Lavista Ferres ,
Bill Weeks ,
Dale E. Fajardo ,
Dale E. Fajardo

AJO International | August 2025

PDF | DOI

Download BibTex

Purpose

To assess the performance of general-domain large language models (LLMs), particularly OpenAI’s Generative Pre-trained Transformer (GPT) models, within the American Academy of Ophthalmology (AAO) Self-Assessment Program, which is based on AAO’s Basic and Clinical Science Course.

Methods

We input 3357 questions into GPT-4o, GPT-4-Turbo, o1 and o3-mini via Microsoft’s Azure OpenAI Service using zero-shot and chain-of-thought (CoT) prompting. Questions with images were analyzed using the multimodal version of GPT-4o and GPT-4.1. The performance of the LLMs was compared to 1371 unique residents who had previously participated in the program. Additionally, we compared the performance on 1399 questions, including information on 3 question types: recall, interpretation, and decision-making or clinical management. Average accuracy rates were used to evaluate performance and compare statistical significance across categories.

Results

o1 (CoT) was the most accurate model (95% confidence interval [CI]: 90.3%–92.1%) with performance ranging from 95.17% (general medicine) to 86.9% (cornea) and 91.1% accuracy on a synthesized sample test. It also outperformed residents in recall-type, interpretation-type, and decision-making or clinical management questions (95.7%, 85.3%, and 90.8%, respectively, P < 0.001). Third-year residents were more accurate than first-year or second-year residents (78.2%, 68.3%, 74.9%, respectively). On multimodal inputs, adding images improved the model’s accuracy but all models still underperformed compared to residents.

Conclusions

The accuracy of the LLMs models continues to improve, with o1 (CoT) showing the highest overall performance. Multimodal inputs can enhance model accuracy, but current models still need improvement. LLMs shows great potential in democratizing access to high-quality medical knowledge.