Here is the continuation of my discussion of recent studies on ChatGPT taking graduate-level professional exams.
Case 2: ChatGPT and Law School Exams
A group of professors at the University of Minnesota Law School gave ChatGPT the final exams from four different law-school courses: Constitutional Law, Employee Benefits, Taxation, and Torts. The tests included both multiple choice and essay questions. The authors shuffled ChatGPT’s answers with those of human students, and graded them without knowing who produced the answers. Overall, the authors reported that “ChatGPT performed on average at the level of a C+ student, achieving a low but passing grade in all four courses.” Interestingly, ChatGPT’s performance was better on the essay questions than on multiple choice questions.
Like the MBA result I described in Part 1, this result was also touted in the news:
Yes, CBS News, ChatGPT passed all four exams, though barely: the original paper on these results noted that “ChatGPT generally scored at or near the bottom of each class” (so perhaps its passing scores indicate a bit of grade inflation?). No, Fox 10 Phoenix, ChatGPT is nowhere near smart enough to graduate law school—it only passed one set of exams, not to mention that law school requires more than passing written exams. Fox 10 Phoenix might be forgiven though, given the hyperbolic title of the original paper, “ChatGPT Goes to Law School”.
Again, it’s quite surprising and impressive that an AI system can pass such a test at all, but, as I said before, this is no guarantee of its ability to generalize beyond the specific questions on these tests. More careful probing is needed before we can conclude much from this result.
Case 3: GPT-3 and Multiple-Choice Bar Exam Questions
Two other law professors gave GPT-3 (an earlier version of ChatGPT) sample questions from the multistate multiple choice (MBE) section of the Bar Exam (paper here). The sample questions were taken from the standard test preparation material offered by the National Conference of Bar Examiners. The authors did some checks to see if these questions might have been in GPT-3’s training data and concluded that the probability was low.
Each question was given with four answer choices, so a purely random system would achieve 25% accuracy. The authors tried several ways of prompting GPT-3: for example, “Ask the model for a single multiple-choice answer only”, which is equivalent to what a human would be asked for; and “Ask the model to rank order its top three multiple choice answers.” The latter prompt type produced the best results. Using this form of prompting, GPT-3’s top choice accuracy was as follows, according to different topics:
The passing range for each topic was 58%–62%. Thus GPT-3 achieved passing grade only in two topics, failing the overall exam. But its scores were significantly above random chance—it did surprisingly well for not being trained specifically on legal knowledge!
The title of the paper on these results is “GPT takes the bar exam”. This is catchy but not exactly true. The MBE is only part of the bar exam, and GPT-3 wasn’t given an actual bar exam but rather a collection of sample questions. Moreover, GPT-3 was asked to rank multiple-choice answers rather than to return a single answer, as a human test taker would be. Perhaps these are minor points, perhaps not.
To get a better idea of how well GPT-3 understood the questions and can generalize, one would need to give it several variations on questions—variations that test the same knowledge and concepts with different wording, different scenarios, and so on, like my variation on the MBA question in the first part of this post. This would be a useful exercise for someone to do!
Case 4: ChatGPT and Medical School Exams
I’ll describe one more case in which ChatGPT was given a professional exam. A group of medical professionals tested ChatGPT on sample questions from the US Medical Licensing Exam (USMLE), a series of three tests given to US medical students. Notably, the first author of the report is the Medical Director of a medical tech company that may be interested in using large-language-model technology. Also notably (and weirdly), ChatGPT is listed as the third author of the report, and its “affiliation” is given as “OpenAI, Inc.”. In the acknowledgments, the authors note that “ChatGPT contributed to the writing of several sections of this manuscript.” Hmmm.
The stated purpose of the study was to test ChatGPT’s “ability to perform clinical reasoning”. The questions given to ChatGPT were a subset of publicly available sample questions provided for test preparation. The authors did some “random spot checking” to ensure that none of these questions were likely to appear in ChatGPT’s training set.
The authors obtained 376 sample test questions, and removed the 71 questions from this set that contained “visual assets such as clinical images, medical photography, and graphs”—after all, ChatGPT can deal only with text, not images. (A 2022 paper described giving USMLE-type questions to Google’s PaLM language model but used few-shot rather than zero-shot prompting; I won’t describe those results here.)
How did ChatGPT do? Did it pass the test?
It’s complicated.
In short, the system’s performance depends on how it is prompted and on how you interpret the results.
The authors tried several prompting schemes. The simplest was just to give ChatGPT each multiple-choice question verbatim along with its answer choices, and request a single-letter answer. This is the same thing that humans taking the test are asked for. The authors also experimented with prompts that requested justification for the answer, and prompts with “open-ended” questions, where the answer options were not given and an open-ended response is requested. ChatGPT’s responses to the latter had to be graded according to an expert human’s judgment.
ChatGPT did better on the open-ended version than on the multiple-choice versions. But if the goal is to compare ChatGPT with humans on this test, the fairest comparison is on the simple multiple-choice option that requests it to return a single letter.
The second complication is, what should you do if the system does not respond with a single letter, or for the open-ended version, if its response isn’t clearly correct or incorrect? The authors called these “indeterminate” answers. One option for dealing with them is to delete these questions from the results analysis (or “censor” them, as the authors term it). A second option, which sounds much fairer to me, is to count these as incorrect. Naturally, the latter option yields a lower score on the test.
The results for the multiple-choice version are shown in the diagram below (the diagram here was adapted from the original report). If indeterminate answers are considered incorrect, then ChatGPT fails badly on the first test, and barely passes (or borderline fails) on the second two tests.
However, in subsequent discussion in the paper, the authors claim that ChatGPT “approaches or exceeds the passing threshold for USMLE”, which was confusing to me.
Here are some samples of news headlines about this result:
Um, Insider and Daily Mail, it’s more complicated than that. If you count “indeterminate” answers as wrong (and you should!), ChatGPT failed the first test, no matter what kind of prompt was used, and barely passed the second two tests. Moreover, ChatGPT was not exactly given the “same test questions” given to aspiring doctors; recall that all the questions using images or diagrams were deleted, and that the questions given were “sample” test preparation questions, not an actual exam.
I don’t want to take away here from the amazing (though still not human-level) performance of ChatGPT, but the headlines reporting on this study, as well as some statements in the actual study, are inaccurate.
Conclusions
There are two main points I hope to have made in the two posts in this series. The first is, don’t believe what news headlines tell you about AI systems—the actual results are usually much less flashy and more nuanced. The second point is that, while it’s fun—and relatively easy—to give AI systems existing exams designed for humans, it is most definitely not a reliable way to figure out their abilities and limitations for real-world tasks. (The same might be said of humans, but that’s another story.) This point is similar to arguments researchers have made about problems with widely-used benchmarks for evaluating natural language “understanding”.
There are dangers in assuming that tests designed to evaluate knowledge or skills in humans can be used to evaluate such knowledge and skills in AI programs. The studies I have described in this post all seemed to make such assumptions. The Wharton MBA paper talks of ChatGPT having abilities to “automate some of the skills of highly compensated knowledge workers in general and specifically the knowledge workers in the jobs held by MBA graduates including analysts, managers, and consultants.” The USMLE paper says that, “In this study, we provide new and surprising evidence that ChatGPT is able to perform several intricate tasks relevant to handling complex medical and clinical information.” I would not be surprised if systems like ChatGPT will be useful tools for lawyers, healthcare workers, and business people, among others. However, the assumption that performance on a specific test indicates abilities for more general and reliable performance—while it may be appropriate for humans—is definitely not appropriate for AI systems.
I’ll end with the best headline reporting on these studies:
Thank you for bringing some knowledgeable sanity to this important issue. The way this is packaged in the media is indeed a problem, and I am troubled by the way this noise has been drowning out the signal. Indeed, what is the signal here? It does exist: the reality remains that ChatGPT will pass many of our exams - as many of our colleagues have verified. The point the media gets wrong is that this is not about the algorithm, but about the exam! This is about the proxy measures we have used in education that won't work anymore. This is about the need to move to more human-centred modes of testing, and this is about the problem that those do not scale.
The resources we posted at the Sentient Syllabus Project (http://sentientsyllabus.org) have as their first principle: An AI can not pass this course. This is aspirational, but also a "survival strategy". We just can't afford to allow the AI a passing grade – if we are not better than the AI, we can no longer bring value to the system. But how to be better? Part of the answer is: to stand on the AI's shoulders, another part is to make the value explicit. There's more to be said, than will fit into this margin, I've written about it in the posts right here on Substack - just click on the profile.
Even so, a colleague of mine said: taking the AI performance as the failing grade? No way. Our students won't be able to do that.
That's where the real challenge is.
I just subscribed here, looking forward to learn more about your take on this.
You might be amused by my own little piece: https://open.substack.com/pub/salvatoreattardo/p/chatgpt4-failed-my-pragmatics-exam?r=1jtkvf&utm_campaign=post&utm_medium=web