E4596. GPT Goes to School: Evaluating the Performance of GPT with Retrieval-Augmented Generation on the Final FRCR Part A
  1. Shu Wen Goh; Tan Tock Seng Hospital
  2. Andrew Makmur; National University Hospital
  3. Hsien Min Low; Tan Tock Seng Hospital
  4. Yonghan Ting; Tan Tock Seng Hospital
The Royal College of Radiologists' (RCR) Fellowship of the Royal College of Radiologists (FRCR) 2A examination is a significant milestone in the education of radiology residents. GPT-4 has achieved a pass on multiple examinations, including medical ones, however the hallucination effect remains a stumbling block to its implementation in such knowledge-based applications. To this end, retrieval-augmented generation (RAG) is a recent technique that may improve the accuracy of large language models by making use of sources like textbooks to provide the basis for a response. Here, we evaluate the performance of stock GPT-4 compared to GPT-4 augmented with radiology and FRCR 2A books.

Materials and Methods:
Stock and GPT-4 with RAG (RAG-GPT) were subjected to a set of 379 FRCR 2A questions from a variety of sources. Questions were further subdivided by subspecialty and difficulty level, according to Bloom’s Taxonomy. The overall grade for both instances of GPT-4 were calculated, as were both models’ performance by subspecialty and difficulty level. If RAG-GPT could not provide an answer based on the provided context, the answer from the stock GPT-4 was used. The pass mark was arbitrarily set at 60%, and evaluation of the helpfulness of the context was done on a 5-point Likert scale.

Stock GPT-4 achieved a score of 242/379 (64.0%, pass). RAG-GPT achieved a near-identical score of 243/379 (64.0%). RAG-GPT showed the greatest improvement in gastrointestinal/hepatobiliary questions, improving from the stock GPT-4 performance of 67.1% to 75.0%. RAG-GPT also showed the greatest improvement in level 3 (2nd-order thinking) questions, improving from 31.8% to 50.0%. The usefulness of the RAG context was rated at a mean of 3.6; the score improved significantly to 4.0 in correctly answered questions with a returned context (p<0.01). The greatest increase in the usefulness was in level 3 questions, from 3.8 to 4.2.

These results demonstrate the depth of clinical reasoning required to pass an examination such as the FRCR 2A. Interestingly, RAG did not contribute significantly to the final score, though it frequently generated helpful references that can help in studying, particularly in questions where the context was deemed to be sufficient. This applies even if the model gave a wrong answer. RAG-GPT could therefore serve as an additional resource for learners by providing related study questions, explanations for unanswered queries or participating in the evaluation of resident-set questions for peer instruction. In addition, the (passing) score of RAG-GPT could allow it to function as an impartial adjudicator, assessing whether questions are of an equitable degree of difficulty. In conclusion, stock GPT-4 has the potential to pass FRCR 2A, and the addition of RAG could allow it to serve as an additional resource by suggesting further areas of reading. However, the integration of LLMs into medical education must be approached cautiously, ensuring that these tools are used responsibly and in conjunction with human supervision and guidance.