Using ChatGPT-Generated Practice Exam Questions in Medical Education

Background

As AI use in education grows, its risks and benefits remain debated. Prior research was examined using Large Language Models (LLMs) to generate practice questions (PQs), showing benefits such as time savings. Concerns include the potential for inaccurate content.
Building on this, in 2023, we conducted a needs assessment to understand first-year medical students’ study habits. All respondents reported using PQs, and 75% said they always use PQs when available. Subsequently, our team aimed to develop PQs for medical students via LLM, ChatGPT, utilizing Kolb’s Experiential Learning Theory as our framework. Our work aligns to “Active Experimentation,” using PQs to test one’s knowledge, and “Reflective Observation,” using PQs to promote metacognition.
Scaffolding on these points, our pilot study had three goals: (1) examine the process of developing PQs using ChatGPT, (2) evaluate whether AI-generated PQs impacted student exam grades, and (3) assess student satisfaction with AI-generated PQs.

Methods

Our institution follows an integrated, systems-based curriculum, organizing instruction into organ system–specific “blocks.” For this study (IRB#0077-23-EX), we selected the Circulatory and Respiratory Blocks due to historically lower mean exam scores and higher standard deviations, suggesting greater potential for improvement.
An upper-level medical student used ChatGPT to develop USMLE-style PQs and responses for first-year medical students in the Class of 2027 (‘27) using learning objectives. This was an iterative process involving multiple reviews by faculty experts to ensure accuracy.
Finalized PQs were distributed via an online platform at least 24 hours before each exam. Students could complete the questions multiple times, receiving correct answers and explanations. The analysis does not account for repeated attempts by the same student.
We compared '27 exam performance to the Class of 2026 (‘26), who received identical instruction but did not have PQs. Additionally, we analyzed intra-cohort differences within '27 exam scores, comparing students who used the PQs at least once to those who did not. Statistical significance (p≤0.05) was examined using equal-variance, two-sample t-tests. Student satisfaction was assessed using a Likert-type survey and analyzed with descriptive statistics.

Results and Discussion

In total, 211 practice questions were distributed to '27 across three Circulatory and two Respiratory Block exams. Utilization increased from 12.1% (n=16) for the first exam to 36.4% (n=48) for the last exam.
Initially in PQ development, the absence of a standardized ChatGPT input resulted in output inconsistencies. Student and faculty reviewers found errors such as incorrect/insufficient answers and explanations, resulting in the alteration or deletion of 5 of the first PQs (14.7% of the total). Consequently, we developed a workflow to minimize errors in question/answer development.
After implementing a standardized prompting workflow, fewer PQ revisions were required. This improvement may reflect ChatGPT’s adaptive capabilities, where ongoing user interaction enhances outputs. Increased platform familiarity likely also contributed, as more precise inputs produced higher-quality responses, similar to previous findings.
There were no significant demographic differences between ‘26 (n=132) and ‘27 (n=132). However, the median overall GPA was significantly lower in ‘26 than ‘27 (p=0.02). ‘27 scored significantly higher on average on the second and third circulatory exams (p=0.02, p<0.01, respectively). No significant differences were observed on the other three exams (p=0.11, p=0.90, p=0.96, respectively). The differences may be due to PQ usage or may reflect higher overall ‘27 performance, as suggested by their higher average GPAs.
When comparing the five exam scores within '27 between students who used PQ at least once and those who did not, no significant differences were observed (n=16 p=0.55, n=36 p=0.84, n=42 p=0.11, n=45 p=0.24, n=48 p=0.07, for each exam, respectively). Despite statistical insignificance, students who used PQs at least once had consistently higher average exam scores (86.3, 89.5, 88.5, 85.0, and 87.1) than students who never used them (84.8, 89.2, 85.9, 82.8, and 84.7). We will explore if increasing sample size leads to significant differences in test scores, as initial n-values were low.
Students who used ChatGPT-generated PQs (n=35) agreed or strongly agreed that the questions improved their performance on Circulatory (68.6%), and Respiratory examinations (65.7%).
Limitations include implementation at a single institution within only two Blocks. Additionally, results may overrepresent the number of students utilizing the PQs. Further research across multiple institutions and diverse curricula with larger sample sizes is needed, but one takeaway is clear: thorough educator review of AI-generated content is essential to ensure accuracy and clarity.

Jack Paradis

Medical Student, University of Nebraska Medical Center

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.