Table of Contents
Table of Contents
If I ask GPT-3 who the current prime minister of the UK is, it says Theresa May.
I'll admit this is a challenging question. Our most recent PM Liz Truss was outlived by a , and we've barely sworn in the new Rishi Sunak. But it proves the point that GPT-3 is not a reliable source of up-to-date information. Even if we ask something that doesn't require keeping up with the fly-by-night incompetence of the UK government, it's pretty unreliable.
It regularly fails at basic maths questions:
And it's more than happy to provide specific dates for when ancient aliens first visited earth:
This behaviour is all well-known and well-documented. In the industry we call it “hallucination.” As in “the model says there's a 73% chance a lettuce would be a more effective prime minister than any UK cabinet minister, but I suspect it's hallucinating.”
The model is not being intentionally bad or wrong or immoral. It's simply about what word might come next in your sentence. That's the only thing a GPT knows how to do. It predicts the next most likely word in a sequence.
These predictions are overwhelmingly based on what it's learned from reading text on the web – all our social media posts, blogs, comments, and Reddit threads were used to train the model.
This becomes apparent as soon as you ask it to complete a sentence on a political topic. It returns the statistical median of all the political opinions and hot takes it encountered during training.
GPT-3 is not the only large language model plagued by incorrect facts and strong political views. But I'm going to focus on it in this discussion because it's currently the most widely used and freely available by a significant margin. Many people who aren't part of the machine learning and AI industry are using it. Perhaps without fully understanding how it works and what it's capable of.
How much should we trust the little green text?
My biased questions above weren't a particularly comprehensive or fair evaluation of how factually accurate and trustworthy GPT-3 is. At most we've determined that it sometimes answers current affairs and grade-school maths questions wrong. And happily parrots conspiracy theories if you ask a leading question.
But how does it fair on general knowledge and common sense reasoning? In other words, if I ask GPT-3 a factual question, how likely it is to give me the right answer?
The best way to answer this question is to look at how well GPT-3 performs on a series of industry benchmarks related to broad factual knowledge.
In the presenting GPT-3, the OpenAI team measured it on three general knowledge benchmarks:
- The benchmark measures how well a model can provide both long and short answers to 300,000+ questions that people frequently type into Google
- The benchmark similarly measures how well it can answer 6,000 of the most common questions asked on the web
- The benchmark contains 950,000 questions authored by trivia enthusiasts
Other independent researchers have tested GPT-3 on a few additional benchmarks:
- The covers 14,343 yes/no questions about everyday common sense knowledge
- The benchmark asks 817 questions that some humans are known to have false beliefs and misconceptions around. Such as health, law, politics, and conspiracy theories.
Before we jump to the results you should know the prompt you give a language model how well it performs. consistently improves the model's accuracy compared to zero-shot prompting. Telling the model to act like a knowledgable, helpful and truthful person within the prompt also improves performance.
Here's a breakdown of what percentage of questions GPT-3 answered correctly on each benchmark. I've included both zero- and few-shot prompts, and the percentage that humans got right on the same questions:
Zero shot | Few shot | Humans | |
---|---|---|---|
Natural Questions | 15% | 30% | 90% |
Web Questions | 14% | 42% | 🤷♀️ |
TriviaQA | 64% | 71% | 80% |
CommonsenseQA | 🤷♀️ | 53% | 94% |
TruthfulQA | 20% | 🤷♀️ | 94% |
Sorry for the wall of numbers. Here's the long and short of it:
- It performs worst on the most common questions people ask online, getting only 14-15% correct in a zero-shot prompt.
- On questions known to elicit false beliefs or misconceptions from people, it got only 20% right. For comparison, people usually get 94% of these correct.
- It performs best on trivia questions, but still only gets 64 ~ 71% of these correct.
While GPT-3 scored “well” on these benchmarks by machine learning standards, the results as still way below what most people expect.
This wouldn't be a problem if people fully understood GPT-3 limited abilities. And yet we're already seeing people turn to GPT-3 for reliable answers and guidance. People are using it in lieu of Google and Wikipedia. Or even substituting it for legal counsel.
Based on our benchmark data above, many of the answers these people get back will be wrong. Especially since most people ask GPT-3 questions without additional prompt engineering or few-shot examples.
[The problem isn't these people. They came to an interface that looked like it provided reliable answers. There were no disclaimers or accuracy stats or ways to investigate an answer. One of the major issues with language models is they seem so confident and capable, we desperately want them to work.]
GPT-3 beyond the playground
The problem isn't limited to people directly asking GPT-3 questions within the OpenAI playground. More and more people are being exposed to language models like GPT-3 via other products. Ones that either implicitly or explicitly frame the models as a source of truth.
is a chatbot-style app that mimics office hours with a professor. You put in a specific subject and GPT-3 replies with answers to your questions.
Riff is doing some prompt engineering behind the scenes and fetching extra information from the web and Wikipedia to make these answers more reliable. But in test-driving it still hallucinated. Here I've asked it for books on since I know the field well and have my own I recommend to people:
At first this seems pretty good! The "Hockings" it's telling me about is , a real British anthropologist and professor emeritus at the University of Illinois. But he hasn't done any work in digital anthropology, and certainly hasn't written a book called “Digital Anthropology.” This blend of truth and fiction might be more dangerous than fiction alone. I might check one or two facts, find they're right, and assume the rest is also valid.
Framing the model as a character in an informative conversation does help mitigate this though. It feels more like talking to a person – one you can easily talk back to, question, and challenge. When other people recite a fact or make a claim, we don't automatically accept it as true. We question them. “How are you so sure?” “Where did you read that?” “Really?? Let me google it.”
Our model of humans is that they're flawed pattern matching machines that pick up impressions of the world from a wide variety of questionable and contradictory sources. We should assume the same of language models trained on questionable and contradictory text humans have published to the web.
There's a different, and perhaps more troublesome, framing that I'm seeing pop up. Mainly from the copywriting apps have been released over the last few months. This is language-model-as-insta-creator.
These writing apps want to help you pump out essays, emails, landing pages, and blog posts based on only a few bullet points and keywords. They do what I'm calling the approach where you type in a few key points, then click a big green button that “magically” generates a full ream of text for you.
Here's an essay I “wrote” in by typing in the title “Chinese Economic Influence” and then proceeding to click a series of big green buttons:
I know next to nothing about Chinese economic influence, so I'm certainly not the source of any of these claims. At first glance the output looks quite impressive. On second glance you wonder if the statements its making are so sweeping and vague they can't really be fact-checked.
Who am I to say "Chinese economic influence is likely to continue to grow in the coming years, with potentially far reaching implications for the global economy" isn't a sound statement?
Here's me putting the same level of input into , then relying on their "create content" button to do the rest of the work:
Again, the output seems sensible and coherent. But with no sources or references to back these statements up, what value do they have? Who believes these things about China's economy? What information do they have access to? How do we know any of this is valid?
[Now is the moment to disclose I have a lot of skin in this game. I'm the product designer for , a research assistant that uses language models to analyse academic papers and speed up the literature review process.
Frame language models as helpful tools, but ones we should question. Tools to validate their answers.
But it means I also understand the key difference between a tool like Elicit and plain, vanilla GPT-3. Which is to say, the difference between asking zero-shot questions on the GPT-3 playground, and using a tool designed to achieve high accuracy scores on specific tasks by fine-tuning multiple language models.]
99 language model problems
Okay, perhaps not 99. There are three I find particularly important.
-
Trust is an all or nothing game. One rotten answer spoils the soup. If you can't trust all of what a language model says, you can't completely trust any of it. 90 correct answers out of 100 leaves you with 10 outright falsities, but you have no way of knowing which ones.
-
They lack situated knowledge. They role play. One critical problem with language models we're going to have to repeatedly reckon with is their lack of positionality. All knowledge is situated. In time and place, in culture, in specific identities and lived realities. There is no such thing as “the view from nowhere.”
And yet language models don't present knowledge from a fixed point in reality. They shift between identities. They role play and take on characters based on the prompt. It can tell you an in-depth story about what it's like to be a squirrel in one moment, and not know what a squirrel is in another.
If I tell it it's a very clever mathematician and ask it X, it gives the correct answer Y.
If I tell it it's bad at maths and ask again, it suddenly doesn't know the answer.
- We've already come to expect omniscience from them. The problem is less that these models frequently return outright falsehoods or misleading answers, but that we expect anything else from them. The decades-long about the all-knowing, dangerously super-intelligent machine that can absorb and resurface the collective wisdom of humanity has come back to bite us in the epistemic butt.
[Karpathy says language models should be thought of as oracles]
The problem isn't the current state of GPTs. The problem is we're trying to make them generate original thoughts, rather than help us reflect on our own thoughts.
We're in the very early days of generative transformers and large language models. GPT-3 came out in 2020. We're 2 years into this experiment.
The lesson here is simply that until language models get a lot better, we have to exercise a lot of discernment and critical thinking.
Until we develop more robust language models, and interfaces that are transparent about their reasoning and confidence level, we need to change our framing of them. We should not be thinking and talking about these systems as superintelligent, trustworthy oracles. At least, not right now.
We should instead think of them as rubber ducks.
Epistemic rubber ducking
Rubber ducking is the practice of having a friend or colleague sit and listen while you work through a problem. They aren't there to tell you the solution to the problem, or help actively solve it. They might prompt you with questions and occasionally make affirmational sounds. But their primary job is to help you solve their problem yourself. They're like a rubber duck, quietly listening, while you talk yourself out of a hole.
[image of back and forth discussion with a rubber duck]
The term comes where you're frequently faced with poorly defined problems that require a bit of thinking out loud. Simply answering the question "what am I trying to do here?" is often enough to get started on a solution.
Language models are well suited to rubber ducking. Their mimicry makes them good reflective thinking partners, not independent sources of truth.
And not just any rubber ducking...
[decorate the text with floating rubber ducks and sparkles]
Epistemology is the study of how we know what we know, also called “theory of knowledge.” It deals with issues like how valid a claim is, how strong its claims and counter-arguments are, whether the evidence came from a reliable source, and whether cognitive biases might be warping our opinions.
Epistemic rubber ducking, then, is talking through an idea, claim, or opinion you hold, with a partner who helps you think through the epistemological dimensions of your thoughts. This isn't simply a devil's advocate incessantly pointing out all the ways you're wrong.
A useful epistemic duck would need to be supportive and helpful. It would need to simply ask questions and suggest ideas, none of which you're required to accept or integrate, but are there if you want them. It could certainly prod and critique, but in a way that helps you understand the other side of the coin, and realise the gaps and flaws in your arguments.
A collection of speculative prototypes
What would this look like in practice?
Branches
Daemons
Epi
From anthropomorphism to animals
There's a side quest I promised myself I wouldn't go down in this piece, but I'll briefly touch on it. I think we should take the duck-ness of language models as rubber ducks seriously. Meaning that conceiving of language models as ducks – an animal species with some capacity for intelligence – is better than conceiving of them as human-like agents.
I have very different expectations of a duck than I do of a human. They're capable of sensing their environment and making strategic decisions in response. They have desires – like not being eaten by a fox. They plan ahead by showing up at the right place and time for the man who brings stale bread loaves to the pond.
Kate Darling has around robots: that we should look to our history with animals as a touchstone for navigating our future with robots and AI. And I find it very compelling.
At the moment the analogy floating around is “aliens.” A lot of AI researchers talk about ML systems as . Given our cultural narratives around aliens as parasitic killers that are prone to exploding out of your chest, I'm pretty adverse to the metaphor. Having an alien intelligence in my system sounds threatening. It certainly doesn't sound like a helpful collaborative thinking partner.
I think there's a lot more to explore here around the metaphors we use to talk about language models and AI systems, but I'll save it for another post