On Detecting Whether Text was Generated by a Human or an AI Language Model
Machines are now able to generate text that is hard to distinguish from that generated by humans. This gives rise to many potential problems; for example, enabling efficient automation of misinformation, spam, impersonation, student cheating on writing and coding homework, and so on. How can society deal with the upcoming deluge of machine-generated text? This is an urgent problem, but fortunately there is a lot of promising research in the works aimed at enabling people to detect when a text has been generated by a machine.
Researchers have been designing methods to detect machine creations since the dawn of generative AI (which is much longer in the past than many people realize). But as the machines—in particular, large language models (LLMs) such as GPT-3 and ChatGPT—get increasingly better at mimicking humans, detection gets correspondingly harder.
Last weekend I read two recent papers that describe different, promising approaches to detection. The first one, “DetectGPT Zero-Shot Machine-Generated Text Detection using Probability Curvature”, proposes a method to determine if a text passage was generated by a particular LLM. The second, “A Watermark for Large Language Models”, proposes a method for “watermarking” text as part of the text-generation process, and enabling users to easily check whether a watermark is present. It’s interesting to note that the authors of both papers are all graduate students, postdocs, and faculty at universities, not employees of the big tech companies that are creating LLMs. A lot of innovation in AI is still coming out of academia!
Both papers are fairly technical, but the underlying ideas are pretty simple. I’ll share these ideas with you here. First I’ll give a bit of background which is needed to understand these ideas.
How LLMs Work
LLMs like GPT-3 are deep neural networks—that is, neural networks with many layers of “neurons” connected by billions of weighted links. Given an input text “prompt”, at essence what these systems do is compute a probability distribution over a “vocabulary”—the list of all words (or actually parts of words, or tokens) that the system knows about. The vocabulary is given to the system by the human designers. GPT-3, for example, has a vocabulary of about 50,000 tokens.
For simplicity, let’s forget about “tokens” and assume that the vocabulary consists of exactly 50,000 English words. Then, given a prompt, such as “To be or not to be, that is the”, the system encodes the words of the prompt as real-valued vectors, and then does a layer-by-layer series of computations, whose penultimate result is 50,000 real numbers, one for each vocabulary word. These numbers are (for obscure reasons) called “logits”. The system then turns these numbers into a probability distribution with 50,000 probabilities—each represents the probability that the corresponding word is the next one to come in the text. For the prompt “To be or not to be, that is the”, presumably the word “question” would have a high probability. That is because LLMs have learned to compute these probabilities by being shown massive amounts of human-generated text. Once the LLM has generated the next word—say, “question”, it then adds that word to its initial prompt, and recomputes all the probabilities over the vocabulary. At this point, the word “Whether” would have very high probability, assuming that Hamlet, along with all quotes and references to that speech, was part of the LLMs training data.
The Probability of a Text Passage
The DetectGPT paper relies on the notion of the probability of a text passage, as computed by a particular LLM. This probability can be computed no matter who wrote the passage—human or machine.
Suppose your text passage is “To be or not to be, that is the question”. The probability of the passage—according to the specific LLM, is just the product of the probabilities it computes for each word, given the previous words in the passage. That is, the model can tell you the probability of starting a sentence with the word “To” (as opposed to any of the other 49,999 words in the vocabulary). Then it can tell you the probability of generating “be”, given the prompt “To”. Then it can tell you the probability of generating the word “or”, given the prompt “To be”. And so on. These are called “conditional probabilities”—each word’s probability is conditioned on the previous words.
In this case, there are ten words (for simplicity I’ll ignore punctuation), so you just need to multiply the ten conditional probabilities together. Let’s call this product Probability(S), where S is the sentence “To be or not to be, that is the question”.
In practice, each individual conditional probability is a very small number (we have 50,000 possible words after all!) and multiplying them together gives a vastly smaller number. To avoid dealing with horribly small numbers (which can cause problems even for computers), people usually use logarithms.
Remember that Probabilty(S) is actually a product of the conditional probabilities of the words in S. As you may have forgotten from a long-ago math class, log Probability(S) is the sum of the logs of each individual conditional probability. This yields a much more tractable number. This number is called the “log probability” of a text passage.
(If you want to try this out, OpenAI’s GPT-3 Playground lets you view log probabilities of generated text.)
Thinking in logarithms of probabilities is basically the same as thinking in probabilities: If the probability of S is larger than the probability of some other sentence S’ then log(Probability(S)) is also larger than log(Probability(S’)). And comparing probabilities is really what we care about here.
The first paper, “DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature”, proposes a method called DetectGPT for deciding if a text passage was generated by a particular “source model”, such as, say, GPT-3. Even though GPT is explicitly named in the title, the method could be applied to any LLM.
The method is based on a simple hypothesis. Suppose you have a text passage—let’s call it T. You want to determine if it was written by a human or generated by, say, GPT-3. The hypothesis says that if T was generated by GPT-3 and you write several paraphrases of T with the same meaning, GPT-3 will give those paraphrases lower log probability than it gives T. The hypothesis further says that if T was written by a human, paraphrases of T won’t, on average have lower log probability than T, according to GPT-3.
In short, the hypothesis says that, if GPT-3 generated T, then T will be the most probable (according to GPT-3) than any semantically equivalent “perturbation” (i.e., paraphrase) of T. And that the same won’t be true if T was written by a human.
The authors test this hypothesis for several large language models (including versions of GPT-3). In particular, the authors take a large number of text passages, some written by humans, some by the LLM, and generate a large set of paraphrases of each passage (using a different text generator to fill in words to generate the paraphrases). For each LLM being tested, the authors use that LLM to compute the log probabilities of each text passage and its paraphrases. The hypothesis predicts that, on average, for the passages generated by the LLM being tested, there will be a bigger difference between the log probabilities of an original passage and its paraphrases than for those between an original human-generated passage and its paraphrases.
And....the authors find that their hypothesis is supported by the experments! They found that in most cases, their method was able to distinguish between human-written and LLM-generated text over 95% of the time.
So, in practice, if you want to detect if a given text passage was generated by a particular LLM or a human, you just need to generate a bunch of paraphrases of it (and this can be done automatically), and compare the log probabilities the LLM gives the original and the paraphrases. If there’s a big-enough difference, then the passage was probably generated by the LLM in question.
But there are, of course, some possible worries.
First, the detection method assumes that you have access to the log probabilities of texts, computed by the model in question. This information is publicly accessible, e.g., for GPT-3 using OpenAI’s API, but not for ChatGPT. So what can you do if you don’t have access to the LLMs computed log probabilities? Or, even more problematic, what if you don’t know which (if any) LLM might have generated a given piece of text?
All is not lost. It turns out that all you need is access to some LLM’s computed log probabilities for a given text, not necessarily those of the LLM that generated the text. After all, pretty much any LLM, whether created by Google, OpenAI, or another organization, will be using a similar neural-network architecture and will have been trained on basically the same training data (huge swaths of online text). The DetectGPT paper shows that, for a machine-generated passage, while the best job of detection is done using probabilities computed by the LLM that generated the text, using other models to compute the probabilities will typically also do a reasonable job of detection.
Another possible worry is that a human “adversary” trying to pass off a passage as human-written might simply take a machine-generated passage and rewrite it slightly, as a way to bypass detection. Will this minor human effort indeed defeat the detection method? The DetectGPT authors experimented with this scenario, and showed that the human would have to change quite a bit of the text to make this work; for instance, they showed that the detection method still works pretty well even if nearly a quarter of the text has been rewritten.
A human trying to bypass the detector could plausibly use another language model to rewrite more of the text. This might work. But not only does this take extra effort, there is always the possibility that, as LLMs are used to rewrite text, the text will start to go down in quality.
In short, a determined human trying to fool the detection algorithm will likely be able to succeed, but at the cost of time and text quality. And detection methods like DetectGPT will undoubtably increase in effectiveness in the future with additional research. The result will be the familiar “arms race” we’ve seen in the past—in online spam, search engine optimization, and so on—between people trying to game technologies and technologies attempting to detect such gaming.
Watermarking Generated Text
Watermarks have been used for millenia to establish ownership, timestamps, and other information in in books and other “hardcopy” objects. Digital watermarking—embedding a signal in digital information—dates back several decades. Digital watermarks are easier to insert in images and videos than in text, due to the more continuous and spatial nature of visual information. How would one insert a watermark into text generated by a LLM?
In the paper “A Watermark for Large Language Models” , the authors propose a relatively simple method. The idea is that the creators of the LLM would add a “watermark” signal to any generated text passage, such that the meaning and quality of the passage is not altered by the signal, the signal can easily be detected without needing any access to the LLM that generated it, and the signal cannot be easily removed by simple modifications to the text.
The watermarking method proposed by the authors is quite simple to understand. First, the authors describe an even simpler (though flawed) method, which they call Text Generation with Hard Blacklist. Given a prompt, the LLM, as usual, computes a probability distribution over the model’s vocabulary. The Hard Blacklist watermarking method then takes the last word of the prompt and uses it to generate a seed for a random-number generator. The random-number generator is used to randomly select a set of vocabulary words as a “blacklist”—words that cannot be generated as the next word. The LLM is now allowed to sample only from the non-blacklisted words. This process is repeated as each new word is generated (and added to the new “prompt” for the next word).
Now, suppose that you have a text passage T, and you want to know if T has been watermarked using the Hard Blacklist method. Assuming you know what the vocabulary is, for each new word that is generated, you can recover the blacklist using the same seed and random number generator that was applied to create the blacklist. A sufficiently long human-generated text will almost certainly use some of the blacklisted words, whereas the LLM-generated text will not. (The assumption throughout the paper is that the makers of the LLM will provide open-source watermark detection software that will allow you to do this easily for any generated text.)
The main flaw in this approach is that having a hard blacklist of prohibited words will typically reduce the quality of the generated text. Can a similar idea be applied but without compromising on text quality?
The authors of the paper propose such an idea: Text Generation with Soft Blacklist. This is the same as the hard blacklist version, but instead of banning the blacklist words, this method simply reduces their probability of being generated.
To see how this can be done, remember that given a prompt to the LLM, the LLM computes a set of “logits”—numbers associated with each item in the model’s vocabulary. The model then turns these logits into probabilities. The soft-blacklisting idea is to increase the values of the logits corresponding to the non-blacklisted words, and then generate a new probability distribution over the words in the vocabulary. Non-blacklisted words will now be generated with higher probability by the LLM. Again, this process is repeated for each new word the system generates.
There are some parameters to need to be decided in order make this approach work well, including the size of the blacklist, the amount to increase the logit corresponding to each non-blacklisted word, and so on.
The authors experimented with many different parameter settings, and generally found that the method worked quite well—in text generated by a watermarking LLM, the added watermark could be detected with high confidence.
The detection failures were mostly on text that had been “memorized” by the LLM—that is, where the probability of a particular next word was very high (as in “To be or not to be, that is the...”, where the word “question” is going to be overwhelmingly more probable than any other word), so that the decrease in blacklist probabilities were not enough to overcome the large initial probabilities of some words.
Like the DetectGPT method described above, the watermarking method is susceptible to human “adversaries” trying to avoid detection in passing off machine-generated text. For example, an adversary could modify the generated text by manually adding blacklisted words to (or manually deleting non-blacklisted words from) the text to fool the watermark detector, or by using a non-watermarking LLM to paraphrase the text. Like the methods for defeating DetectGPT that I mentioned above, such “attacks” might defeat watermark detection, but at the cost of human effort and quality of the generated text.
Unlike DetectGPT, which analyzes a text after it has been generated by an LLM, watermarking requires adoption by the creators of the LLM, who must modify their product to add such watermarks. OpenAI (creator of GPT-3 and ChatGPT3) is already known to be working on approaches to watermarking the output of their LLMs.
Here I’ve described two possible approaches to detecting machine-generated text: DetectGPT and watermarking. It might be that these (and other methods) could be combined for even better detection abilities. And researchers will continue to come up with better methods. But of course human ingenuity can go both ways—we’ll no doubt see new ingenious ways to thwart detection methods, and the subsequent arms races that we’ve seen so many times between humans and AI systems. While these kinds of detection technologies are extremely important, technologies alone are not going to solve all the social problems we’re facing with generative AI. Like all other society-changing technologies, some kind of society-wide regulation is going to be essential, if we can ever agree on what that should look like. In the meantime, it will be fun to keep an eye on what I expect will be rapid progress in the LLM detection (and detection evasion) space.
February 1, 2023: One important addition to this discussion. A detector for LLM-generated text can have two kinds of errors: false negatives (the text was judged as human-written, but actually was machine-generated) and false positives (the text was judged as machine-generated but was actually human-written). Current detectors make both kinds of errors, but false positives have the potential to be very harmful — for example, if a (human) student turns in an essay, and it is wrongly judged to have been written by a machine. The student might wrongly be accused of cheating. Thus we have to use these detectors with care, and with the knowledge that they might be wrong.
February 1, 2023: One of the authors of the DetectGPT paper has created a publicly accessible demo. Try it out here.
Hi Melanie, excellent article!
BTW, the humorous alt to the Shakespeare one is this one by Mark Twain :)