Hallucinations are generated outputs from large language models (LLMs) that sound plausible but are factually incorrect or ignore the relevant context. Like human dreams, they resemble reality but are not.
Hallucinations are seen as a current weakness of LLMs such as GPT-3 and GPT-4 (used in ChatGPT) and give rise to the risk that users will rely on output that leads them astray.
So, what can be done about them, and are they fatal to the use of LLMs by professionals such as lawyers?
Examples of hallucinations
The recent report of a US lawyer who relied on the output of ChatGPT without verification has brought the tendency of generative AI systems to hallucinate into sharp focus. The lawyer in question, copied output from ChatGPT in a court filing, making references to various cases in support of his argument. Only, those cases do not exist. A hearing is scheduled for 8 June where the unfortunate lawyer will need to show cause why he should not be sanctioned.
Closer to home, recently it was reported that a regional Australian Mayor was preparing to sue OpenAI after ChatGPT falsely named him as a guilty party in a foreign bribery scandal.
The outputs leading to these incidents consisted of hallucinations- entirely fictitious and made up. For some, reports of hallucinations has strengthened their resolve that LLMs are simply not good enough, are not a threat, and are not useful.
Why do LLMs hallucinate?
Some have expressed the view that LLMs hallucinate as they use statistical methods to predict the next token in a series. While it is true that LLMs use statistical methods to determine or generate text, the reason for this limitation is more complex.
Yann LeCun, a key figure in the field of deep learning, has expressed his views on the issue. He believes that the fundamental flaw that leads to hallucinations is that “Large language models have no idea of the underlying reality that language describes,” adding that most human knowledge is nonlinguistic. What we, humans, read forms a small part of our understanding of the world around us and the total of human ‘knowledge.’ We also see, we hear, we smell, and we are part of what is going on around us.
In other words, while ChatGPT can do an extraordinary job by following your instructions to, for example, write a short story, its understanding is limited both by the fact that it is not (yet) multimodal and that it does not have a comprehensive understanding of the world around it. When you ask ChatGPT to draft a letter, it does not ‘understand’ you, the letter, its effect, or how the letter relates to the world around you.
That doesn’t mean that ChatGPT isn’t extraordinarily useful, it just isn’t you. And, it highlights the need for you to bring your understanding of the world and to review the output you are given.
How to combat hallucinations
There are ways to reduce the likelihood of hallucinations or to control their effects. These broadly fall into what can be done at the LLM level (i.e. by the developers of models), what can be done by system designers, and what can be done by individual users- summarised in the following tables.
In this article, I won’t look at what can be done at the LLM level, as this is not our area of focus. Needless to say, there is a lot of effort examining ways to bring improvements at this level. For example, research suggests modifying the beam search algorithm to focus on input-supported tokens, which can help constrain the generation of hallucinated output.
Table A: What can be done by systems that integrate LLMs
Human in the loop is not a scalable solution and is the one that carries the least prospect of being a long-term solution. That is not to say that humans won’t always be ‘in the loop’, I am referring to the method.
Post-processing is complex and difficult, and it yields results. Over time we will have better post-processing options, and indeed there will be billion-dollar businesses built on post-processing.
Larger context windows are coming soon, and when used correctly will have the greatest short-term impact on the performance of LLMs. Our experimentation with Anthropic has shown how useful larger context windows are. Anthropic’s Claude model is exceptionally good at examining longer text and extracting relevant quotes. OpenAI has its 32k model available for a small user group. Most LLM providers are limited by GPU availability at this stage (good for the NVIDIA shareholders).
Table B: What can be done by individual users
The first two mitigations are critical and will be critical for some time to come. Over time, prompt engineering may have less importance, as the purpose of LLMs is to understand intention and to generate relevant output.
Conclusion
Hallucinations in large language models, such as GPT-3 and GPT-4, underscore the current limitations and vulnerabilities of AI systems. While these flaws can lead to negative outcomes as evidenced by the incidents mentioned earlier, it does not spell the end for the use of LLMs in professional settings. Instead, it emphasises the urgent need for a multi-pronged approach to mitigate these effects and prevent the misuse or misunderstanding of AI outputs.
The crux of the issue rests on the realization that LLMs, no matter how sophisticated, cannot yet grasp the totality of human experience and knowledge. The solution, then, may not rest solely on refining the algorithms but in changing our approach to using them.
As we continue to push the boundaries of AI capabilities and integrate them more deeply into our professional and personal lives, a clear-eyed understanding of both their possibilities and pitfalls will be essential.