Chat GPT and AI vs Humans in medical assessment

Introduction

Artificial intelligence and ChatGPT seem to be taking over the world, but can it write scenarios for undergraduate medical assessment?

What is artificial intelligence and ChatGPT?

Artificial intelligence is technology that has the ability to autonomously problem solve after going through a ‘learning’ process and has been discussed before at St Emlyn’s.

Artificial intelligence learning can be thought of similarly to how a child learns. For example, take a child learning to assess danger, at age 5 fearlessly flinging themselves across the monkey bars, to a 10-year-old more cautious of heights, to a 16 year old who has a strong grasp of more subtle dangers such as interactions with strangers. This learning occurs through massive amounts of experience, the more the child experiences the more they learn. This is the same concept for artificial intelligence except life experience is replaced with massive amounts of data. Before release artificial intelligence is trained using large data sets, the more data it consumes the more it ‘learns’ leading to a system which is able to interpret problems and provide a good level of accuracy in its predictions. Chat GPT is a large language model of artificial intelligence, it has the ability to follow human prompts, summarise and produce information in a manner that is accurate and ‘human-like’ in its response.

What is already known about the pros and cons of chat GPT in medical education?

ChatGPT has significant potential within the field of medical care and medical education^1–3, its ability to take large amounts of information and almost instantly convert this into readable summaries. In medical education, this could be particularly useful was a medical student when your supervisor asks you to take lunch to revise the foetal circulation. ChatGPT can produce an instant summary of a subject resulting in increased accessibility and reduction in time spent reading multiple resources. ChatGPT can also aid learning by giving prompts and learning guides which can be tailored to medical education. For example I asked ChatGPT to give me a list of topics related with gastrointestinal medicine – this is what it produced.

You
Give me a list of topics I should revise related to gastrointestinal medicine
ChatGPT
3. Diagnostic Procedures:
• Endoscopy (Upper GI Endoscopy, Colonoscopy)
• Imaging (CT Scan, MRI, Ultrasound)
Certainly! Here's a list of topics related to gastrointestinal medicine that you might consic⚫ Biopsy revising:
1. Anatomy and Physiology of the Gastrointestinal Tract
2. Common Gastrointestinal Disorders:
• Gastroesophageal Reflux Disease (GERD)
• Peptic Ulcer Disease
• Gastritis
• Irritable Bowel Syndrome (IBS)
• Inflammatory Bowel Disease (IBD) including Crohn's Disease and Ulcerative Colitis
• Diverticular Disease
• Gastroenteritis
• Celiac Disease
• Colorectal Cancer
• Hemorrhoids
Gallstones and Gallbladder Disorders
• Liver Diseases (e.g., Hepatitis, Cirrhosis)
Pancreatitis
• Laboratory Tests (Liver Function Tests, Stool Tests)
4. Treatment Modalities:
• Pharmacotherapy (Antacids, Proton Pump Inhibitors, Antibiotics, Immunomodulators)
• Dietary and Lifestyle Modifications
• Surgical Interventions (e.g., Appendectomy, Colectomy)
• Endoscopic Procedures (Polypectomy, Hemostasis)
5. Nutritional Support and Management:
• Enteral and Parenteral Nutrition
• Diet in Gastrointestinal Disorders
• Malabsorption Syndromes

Furthermore, as a computer-generated system ChatGPT has no bias. Bias, is a subject close to our hearts at St Emlyn’s and something we have blogged and spoken about a lot. can be present in the medical education system whether that be a clinical supervisor who is a massive pharmacology enthusiast or being placed in a smaller hospital with no access to certain subspecialities. Unlike these examples ChatGPT acts without a conscious or subconscious bias which presents an opportunity to level the playing field in medical education.

To respond accurately to requests ChatGPT underwent a learning process where it consumed a large dataset in order to form pattern recognition and language recognition sequences. Therefore ChatGPT is reliant on the information it is programmed with and the data it produces is dependent on that. We still are struggling to understand the way ChatGPT selects resources and how it would be able to cope with the rapidly evolving nature of medical literature. Several studies have highlighted AI ‘hallucination’ (the ability of AI to produce factually inaccurate information) as a serious pitfall present in the current version. Alkaissi et al. showed that not only did ChatGPT generate incorrect information when asked to cite its sources it then cited non-existent sources. These AI hallucinations pose serious issues if ChatGPT becomes more integrated into medical education.

Finally, a significant concern for the medical education community is overreliance on AI assistance. ChatGPT can process and summarise a wealth of information accessible on the internet however it has no capabilities to make clinical judgement and critical thinking skills that medicine heavily relies on.

The Study – Chat GPT to generate clinical vignettes for teaching and multiple chocie questions for assessment

The Abstract

Aim: This study aimed to evaluate the real-life performance of clinical vignettes and multiple- choice questions generated by using ChatGPT.
Methods: This was a randomized controlled study in an evidence-based medicine training pro- gram. We randomly assigned seventy-four medical students to two groups. The ChatGPT group received ill-defined cases generated by ChatGPT, while the control group received human-written cases. At the end of the training, they evaluated the cases by rating 10 statements using a Likert scale. They also answered 15 multiple-choice questions (MCQs) generated by ChatGPT. The case evaluations of the two groups were compared. Some psychometric characteristics (item difficulty and point-biserial correlations) of the test were also reported.
Results: None of the scores in 10 statements regarding the cases showed a significant difference between the ChatGPT group and the control group (p > .05). In the test, only six MCQs had acceptable levels (higher than 0.30) of point-biserial correlation, and five items could be considered acceptable in classroom settings.
Conclusions: The results showed that the quality of the vignettes are comparable to those created by human authors, and some multiple-questions have acceptable psychometric characteristics. ChatGPT has potential in generating clinical vignettes for teaching and MCQs for assessment in medical education.

What type of study is this?

This is a randomised control study with the aim of assessing, in the opinion of medical students, if ChatGPT could produce clinical vignettes and multiple-choice questions (MCQs) that were accurate and comparable to those written by medical educators. The study was conducted at Gazi University Faculty of Medicine, Turkey and the initial study group comprised of 74 fourth year medical students.

Methods

The study group was randomly assigned to receive either ChatGPT or human-generated cases and questions. All other education before this was the same for all the students. ChatGPT 3.5, a free to use version, was used between December 2022 and January 2023. Since this study, a subscription version of ChatGPT is now available, but the authors comment that even if they had this at their disposal they would’ve chosen the free version, to ensure their results applied to all, including those where a paid version is unaffordable.

Clinical cases

The investigators inputted prompts into Chat GPT to produce 37 vignettes, using this template

We need an ill-defined medical case. Medical students will use this case to apply evidence-based medicine principles. For this reason, the case should include a dilemma. It should be on[THE DISEASE OR PROBLEM]. The case should consist of [THE NUMBER OF SENTENCES] sentences. Provide the age and gender of the patient.

A ChatGPT version of each case was generated using the condition and length of the human written example. These were then reviewed by a subject matter expert to ensure suitability. 15 cases were then submitted for further refinement, using these prompts (you have to love how polite they are…)

“Please use a specific treatment/medicine in the case.”
“Please remove the generic name of the medicine and use the active substance name instead.”
“Please mention a specific treatment option and write the case again.”
“Please mention the name of the test and write the case again.”
“Please mention for which genetic disease the test has been performed and write the case again.”

The only revisions made to the cases were made by ChatGPT and not the subject matter experts themselves.

Multiple choice questions

Fifteen MCQs were generated, all testing the students’ knowledge about medical statistics and evidence-based medicine. For these, each prompt began with

Write a case based single best answer multiple choice question with five options that…

There were no changes made to these questions after review.

Results

Of the 74 students in the original group, 10 declined to fill in the evaluation form, leaving a sample of 64 students, 34 in the ChatGPT group and 30 in the control group.

A ten-question questionnaire was used to evaluate the students’ opinions on the vignettes. This covered aspects including coherence, use of clinical reasoning, and utility of the case for evidence-based medical learning and research, as well as if they found them ‘fun to deal with’ and ‘I liked the case’. These were given a score out of five using a Likert scale from 1: Definitely not agree to 5: Definitely agree. There was no significant difference between the ChatGPT and human generated cases across all ten questions.

The MCQs were evaluated for difficulty and point-biserial correlation (point-biserial correlation being the ability of a questionnaire to differentiate between high and low performing candidates). The MCQs spanned a range of difficulty levels. Six of the 15 MCQs reached acceptable levels of point biserial correlation to be appropriate to be used in medical school exams, five questions reached satisfactory point biserial levels to be used in a classroom but not examination setting.

Discussion

The paper was able to successfully evaluate the feasibility of ChatGPT to generate assessments, comparable to those written by humans, in medical education in both vignette and MCQ formats. It reflected other works that can produce questions in a manner that is non-discernible from human-derived questions. Although the trial showed good blinding in the clinical vignettes the small sample size of the study limits its power. A crossover trial may have given even more insight into how the AI-generated vignettes compared to the control vignettes, however, I’m not sure how many students would sign up to do double the work.

The trial provided an interesting insight into the oversight required to generate these questions. 40% of vignettes generated required further modification via investigator-generated prompts. A Likert scale was a good tool to assess the students’ views and assessed both the clinical difficulty and coherence of the vignettes. Another aspect not assessed is the difference in the quality of the answers produced by the students – arguably the key measure of equivalence.

A small number of MCQ questions were completed by all students studied against no control. Their difficulty was assessed by taking the total score of test takers and dividing it by the maximum possible score. Therefore, the study did not ascertain the difference in difficulty between ChatGPT and examiner-derived questions. The point-biserial correlation was helpful in highlighting the standard of questions as falling short of current accepted standards. Although, this is not necessarily a failing by ChatGPT, because exam questions are typically subject to an arduous process of thorough ratification and modification prior to acceptance suggesting this does not make using ChatGPT for MCQs as unfeasible.

Unlike the vignettes, there was no evaluation of the MCQ questions by the students for accuracy or coherence. In fact, the paper did not detail if ChatGPT produced any vignettes or MCQs that were incorrect, this is an essential measure due to the known tendency of AI hallucinations to produce incorrect or outdated information. Evaluation of the material by examiners would be useful to identify critical failures such as AI hallucinations or even clinically incorrect material giving a greater picture of the accuracy of ChatGPT in producing MCQs.

Unanswered questions

Time to efficacy

A major argument supporting ChatGPT and other AI initiatives is the reduction in human labour and the large amount of data produced rapidly. However, as evidenced here, ChatGPT requires oversight review of all material and prompting to produce acceptable responses. An important comparator would be the cost efficacy of ChatGPT vs resources produced by human educators.

Anything you can do I can do better – Plagiarism and reproducibility

For the past 2 years the UK has been piloting the UK medical licensing exam (UKMLA), a final year exam set to be delivered nationally in 2025. The question bank is closely monitored and only a small subset of questions are released in a mock annually. ChatGPT is free to access to all, therefore if prompts given by examiners to write vignettes and MCQs could similarly be accessed by students, there is a high likelihood that at least some of the questions would be known by the AI literate beforehand. Currently, this is somewhat prevented as ChatGPT produces significant variability when given the same prompts.

In November 2022, ChatGPT became free to access to the public. One of the first groups to start utilising this resource were students in higher education, leading to an outcry from higher education institutes. In higher education, Sullivan et al. performed a systematic review highlighting how the biggest concern relating to AI was plagiarism and academic integrity. ChatGPT has been shown to write college level essays and pass the USMLE (the final medical examination in the USA)– highlighting its significant analytical and prose abilities. It is important to re-evaluate the landscape relating to AI and academic integrity for students. Looking even further than this, if students were to be banned from using ChatGPT what would the ethics be in relation to examiners using it to generate questions.

Final Thoughts

ChatGPT can assist medical educators in producing high-quality vignettes that are considered a similar standard to clinician written ones. ChatGPT is also able to produce a variety of MCQ questions some of which meet standards for medical student examinations. Leading on from this article evaluating time efficacy and answer quality is important to further evaluate ChatGPT’s role in this area of medical education.

References

1.
Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus. Published online February 19, 2023. doi:10.7759/cureus.35179
2.
Franc JM, Cheng L, Hart A, Hata R, Hertelendy A. Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale. Can J Emerg Med. Published online January 2024:40-46. doi:10.1007/s43678-023-00616-w
3.
Seetharaman R. Revolutionizing Medical Education: Can ChatGPT Boost Subjective Learning and Expression? J Med Syst. Published online May 9, 2023. doi:10.1007/s10916-023-01957-w