ChatGPT Doesn’t Think. It Just Guesses the Next Best Word.

LLMs and Language

Isn’t that what we…?

ChatGPT and LLMs (Large Language Models of which ChatGPT is an exemplar) have taken the world by storm and triggered a hype cycle equivalent to the original dot.com boom. Almost everyone has tried ChatGPT (200 million active users and counting), and EVERYONE has seen the results of LLMs just by using Google. But while ChatGPT generated an enormous amount of buzz, it’s been slower (at least in media time) to change the world. As the hype cycle crests and people sense an impending crash into the trough of disillusionment, critics have begun to find tragic flaws in what these tools do. LLMs have become notorious for making stuff up — and the reason, say critics, is simple. LLM’s don’t think, they just build sentences by finding the next likeliest word and spitting stuff out. With sufficient training, this allows them to mimic content and often to produce what amount to straightforward distillations of digital knowledge. But that, say critics, is all they can do.

This isn’t entirely wrong, though anyone who has used ChatGPT will surely recognize that it isn’t quite right either. LLMs aren’t like that party game where you go around a circle and each person adds a word — building sentences that quickly spin off the rails into absurdity. That’s partly because LLM’s — unlike people playing party games — aren’t trying to be absurd, but it’s also because LLMs (like people having actual conversations) use your question and a vast amount of content to set a context for a reply. Every word in a dialog becomes part of the context for an answer and has a potentially profound influence on the exact shape and nature of the response. I don’t think anyone who has used ChatGPT — even in contexts where it fails regularly — can help but be impressed with how extraordinarily good it is at mimicking human writing.

But if all it’s doing is filling in the next best word, how exactly is that possible?

Oddly, the mismatch isn’t between our perception of LLMs and how they work, it’s between our perception of human writing and human thinking. The history of modern computing is a history of machines gradually becoming better (often much, much better) than humans in one domain after another. This isn’t always a result of machines replicating human ways of thinking. Sometimes it’s a case of computer scientists and engineers developing algorithms that take advantage of the vastly faster processing of digital computers to achieve results in a manner that is fundamentally different than the way humans solve problems. Computers can add and subtract at speeds faster than humans could ever accomplish (faster than humans could ever read numbers), but they don’t do mathematical calculations the way humans do.

Similarly, when computers first began to beat human players in chess, it wasn’t because computers were playing chess the way grandmasters do. I will always remember a quote from an IBM lead who explained that their machine essentially used exhaustive search to look many plies ahead in a game –evaluating countless positions many moves deep. “Exhaustive search,” he said, “means never having to say you’re sorry.”

That isn’t the way great chess players do it. Despite their known ability to think many moves ahead, top chess players prune the moves they consider ruthlessly based on their profound understanding of the game. Show a chess player a board position from a game for a second and they will memorize and can easily replicate it. That’s awesome. But it’s illuminating to know that if you simply place pieces at random on the board, chess players aren’t significantly better than anyone else at memorizing the piece locations. It’s only within the logic of the game that they can achieve great results.

This reliance on exhaustive search and computing power made it easy for people to suggest that a computer wasn’t really playing chess. An exhaustive search chess algorithm must still embed a fairly sophisticated knowledge of the game (after all, it has to be able to evaluate each position), but despite its computational superiority it really isn’t an especially good chess player.

This all changed with the advent of deep learning systems. When the latest chess (or Go or Halo) programs beat the best human players they are doing it without using anything like exhaustive search. Instead, what these deep learning tools have done is used the massive computational superiority of computers to train themselves on the games in a manner that largely duplicates the way humans learn.

With LLMs, the skeptics are replaying this argument, suggesting that what LLMs do is essentially what exhaustive search chess algorithms do. They use the massive computational superiority of digital compute to replicate the results of a thinking process, but they don’t capture the interesting part of how that process works. People, the skeptics will insist, do not just find the next best word to plug into their sentences.

But is that true?

If there’s one thing we don’t think very well about, it’s how we think. Human thought has been idealized, glorified and put on pedestal. We are the “rational” animal. It is our reasoning that distinguishes us. We have — each of us — the power of moral choice. We think of ourselves as constantly thinking over everything, finding logical connections and making inferences and going wrong only because emotions and desire get in the way.

This is, as we should surely know by now, completely wrong. We are not rational animals. We are animals that have to learn and work to be rational. Our powers of moral choice — like all our powers of choice — must be learned and can easily be corrupted or destroyed. Behavioral economists have shown how often our thinking involves short-cuts, economies of effort, and the shortcomings of fast-training neural networks. We get things wrong all the time because our thinking is spotty, inconsistent, imperfectly trained, and often flawed at a foundational level. Time and again when neuroscience has learned something about how we think it has matched perfectly the way thinking actually feels — yet it has utterly conflicted with our pre-existing folk-notions of how thinking works.

LLMs look like the same story all over again. If you’ve ever actually listened to the way people talk and write, finding the next best word is a pretty compelling description of how it feels and sounds. If you’re human, you’ve had all of these experiences:

· Starting to say something and finding you can’t remember the word you want

· Being surprised by what you just said

· Learning what you think by saying something

· Switching contexts accidentally and saying a related but wrong word

· Finding yourself knowing what you want to write and completely at a loss of how to write it

· Having the experience of writing a sentence and feeling it form as you type it

None of this would make much sense if our thinking was as fully baked as we like to pretend.

What’s more, if you listen to what people actually say, you’ll find that they make things up all the time — confidently stating things that they don’t really know and that may well not be true. I recently saw an article that dinged the latest version of ChatGPT for only getting half the answers of a medical diagnosis test correct. I have no idea (nor did the writer of the article) how a typical GP would fare on that test — but the thing the writer really complained about was how ChatGPT could offer clear and confident justifications for its misdiagnoses. It made me wonder if the author had ever actually talked with a GP. In my experience, the less expert the doctor, the more confident their diagnosis.

I don’t mean to pick on doctors. Everybody does this. My wife is a master of the confident assertion of what is simply not so. People tend to attach bad motives and intentions to this, but ChatGPT has shown us that it isn’t necessarily a matter of intention. We are not particularly good at fixing the way our minds work when we generate words, and this is probably true even in cases where we are fully aware that what we are saying is an…untruth. It’s just one part of our minds (speech generation) running ahead of the regulatory parts.

Most people also maintain a rich, inner dialog of speech. The purpose and importance of inner-speech is debated, but it’s clear that one of the most common experiences we have with inner-speech is hearing ourselves say something that we immediately reject or that seems completely new. Inner-speech plays a role in consciousness and in higher-level System 2 thinking. Language isn’t all there is to thought, but language is clearly an important tool for thinking.

Which brings me to what is wrong with ChatGPT and other current LLMs. When you split out language generation from the rest of our thinking mechanisms, the result can feel half-baked. LLMs aren’t very good at self-checking because they don’t have multiple ways of solving a problem or additional mechanisms for error correcting. Nor do they have a well-defined sense of the quality of their sources from either a factual or stylistic perspective. That doesn’t mean they aren’t writing like humans; it probably means they are.

I’m not certain that LLMs work the way our minds do. There’s usually more than one way to skin a cat. But unlike exhaustive search, the way LLM’s work isn’t obviously different than the way our minds generate language, and the problems with LLM’s make me more — not less — confident that they are doing kind of the same thing we are. Whatever the final truth turns out to be about how we generate language, I am confident that it will turn out to be messy, flawed and in some respects half-baked.

Does this mean that OpenAI and their ilk have solved general intelligence? They have, after all, built something that can pass the Turing Test and may well replicate the way we generate words. I don’t think the answer is yes, but I do think they’ve solved one of the biggest and hardest pieces of the puzzle. We don’t think with just one tool and our minds do more than generate sentences. Yet with the advent of true learning systems like Google’s DeepMind and language systems like ChatGPT, the outlines of a generally intelligent system are coming into focus.

It’s probably kind of surprising that you could solve the language problem and still not solve the whole “general thinking” problem — but everything about the slow march to understanding cognition has been surprising.

We don’t think the way we thought.

It may also be true that language generation, on its own, is less useful than people imagined. Still, ChatGPT’s 200 million users aren’t being forced to use generative AI. Language generation is a big part of general intelligence and an extremely useful capability. Nearly all our interactions with computing in the future will probably be done through LLM-like interfaces.

Nor is it a huge mystery where LLM builders will go next. Regulatory functions (by which I mean internalized model functions to check their own output not government agencies) that can spot and correct hallucinations are an obvious focus for current AI research. Just as with self-driving cars, being as good as the average human isn’t sufficient for an LLM to be successful. We talk to call-center agents because we have to, not because we want to. For a service like ChatGPT to win and keep paying users, it must be consistently better than your average person and faster than just clicking through links. It already is better and faster most of the time — but in that “most” lurks a lot of uncertainty. Removing the uncertainty is job #1 for AI builders.

And, of course, we think in a variety of ways with a variety of tools. A general intelligence tool will necessarily combine multiple problem-solving approaches including language generation. That work, too, is well underway. OpenAI is already delivering thinking tools that incorporate additional ways of problem-solving to improve performance on tasks with more procedural, structural and logical elements (like computer programming). The latest tools from OpenAI program better than most programmers and solve scientific problems as well as many PhDs. They aren’t just LLMs — they’ve integrated language generation with additional AI tools.

Even then AI systems will give us lukewarm product (at least when it comes to writing content) — good enough to provide better Google Search results but not good enough to threaten human standards of excellence.

Perhaps we’d be wise to stop there.

Unfortunately, the path to better isn’t all that hard to see. The generative output from LLM’s reads like something written by committee because, in a sense, it was. With huge, broad-based training a model will learn to write in a way that reflects the mean and medians of human writing. This isn’t the way great writers or speakers learn. They are massively over-influenced by specific works. Writers learn by deep over-training in specific works, ideas and styles.

Though there are challenges, there’s no fundamental reason why LLM’s can’t do the same thing. Human learning tends to be very fast and driven by heavy reinforcement. Machine learning tends to work best (right now) with very large trainings and more gentle reinforcement. Writing with great style, however, is harder to generalize across a lot of material. Nor is it like learning to play Go since there’s no obvious way to decide if you “won” or not with a given strategy. There isn’t enough great writing to train slowly on, and it isn’t, itself, consistent. AI researchers will have to figure out how to get models to overweight quality in ways that come closer to what humans do to achieve real style.

For better or worse, that doesn’t seem like such a hard mountain to climb.

We face a world where machines may well be able to write and speak better than even the best of us can. People still play chess and Go, of course. But those are things we can do against each other. It will be harder to write and paint when people only want to read or see what machines have done. Perhaps we will simply decide to ignore what machines can do in these realms and punish with the most extreme social opprobrium those who use the machine for help. Every genius in that version of the future would live with the taint of suspicion that they “used the machine.”

It is probably the case that current gen AI systems tuned for specific tasks are already better than the average doctor, lawyer, engineer and programmer. Those are apex professions. Even with today’s tools, there aren’t too many jobs where a determined push to use AI wouldn’t yield results that were at least as good as the average professional. That push is going to come wherever governments don’t prevent it — which is why programmers probably have more to worry about than Teamsters.

If this seems depressing, I can only say, “Cheer up, Brian!”. After all, even in a world where machines can do every job better than we can, there will still be those confidently proclaiming that machines do not think — they just find the next best thing to do.


Leave a Reply

Your email address will not be published. Required fields are marked *