Are LLMs just token predictors?
What “just” hides: queryable compression, inner voices, and loops.
Are LLMs just token predictors?
I heard the line recently in a talk about agentic software - the practical kind of talk where everything is a loop, and every loop needs a test.
LLMs are just token predictors.
In one sense, yes.
At the interface, a language model takes a sequence of tokens and produces a probability distribution over the next. That’s the clean description. The one-line label.
Anchor (technical): At inference time, the base model’s whole move is a forward pass through fixed weights (no weight updates during inference) that turns the current token context into next-token probabilities (logits → distribution). Anything that looks like deliberation is built by looping that step. Anything that looks like memory is added around it.
But the word that matters in that sentence isn’t token. It’s just.
“Just” is what we say when we’re done looking. It’s a way of naming a mechanism and declaring the depths irrelevant.
And the problem is: next-token prediction only works at all if something deeper has already happened.
It only works if history has been compressed into a terrain. If that terrain can be interrogated. If constraint can be held onto long enough to feel a bit like thought.
So I want to do something simple, and perhaps a little uncommon.
Let’s take it entirely at face value: predict the next token, and follow it downward until the hidden machinery starts to show. Not magic. Not metaphor as a way to skip the machinery. Just the strange fact that compression, when it’s queryable, can start to resemble something like a mind.
The entry
There, at the gate, a modern language model does something simple to state:
It takes a sequence of tokens and produces a distribution over the next token.
That statement isn’t a detractor; it’s an interface descriptor.
But a descriptor isn’t an explanation.
“Vision is photons hitting the retina” is true. It’s also what you say when you don’t intend to talk about perception.
So let’s grant the entry and ask our first honest question:
What has to be inside a system for next-token prediction to hold together for pages, not seconds?
Compression that keeps the shape
Training doesn’t fill a library. It sculpts a landscape.
A model is hammered by examples until it learns that which tends to hold: what tends to follow, imply, contradict, resolve, qualify. Not as stored sentences, but as bias.
That bias lives as geometry in a high-dimensional space: patterns of association, hierarchy, constraint. When it’s working, you can feel it. The output doesn’t just sound grammatical. It stays on a track.
And that’s where “just” starts to fail.
Because compression isn’t only loss. It’s what remains after you’ve demanded, relentlessly, that the system keep what matters and throw away the rest.
A useful sentence, for now:
The model doesn’t store the world. It stores the statistical shape of how a described-world tends to unfold.
Sidebar: a quick note on “coherence.”
The word didn’t arrive with chatbots.
In NLP, “coherence” has meant discourse coherence for a long time: the glue that makes sentences hang together as a text rather than a pile. There’s a whole lineage of work trying to model or score it (local coherence, entity-based models, sentence ordering, story structure).
In other corners of ML, “coherence” shows up as a metric term too, topic coherence in topic modeling, consistency measures in generation. And in physics the word has its own sharp meaning, phase relationships that persist, but that’s a different essay.
Only recently did it leak into everyday conversation as a vague vibe-check, and then start to reappear everywhere as one of those bridge words that can connect the dots between disparate fields.
At the very surface at least, in this essay, when I use the idea, I mean the old technical thing: constraint that persists across spans.
Prediction, generation, and the controlled use of chance
“Okay,” someone says. “But it’s still predicting. And it still has randomness.”
Yes. And that’s the point.
During inference, the model produces a distribution: a weighted set of next-token possibilities. There are different ways to step forward from that distribution.
If you always pick the single most likely token, you often get something correct and dead. Not always. But frequently.
If you sample, you’re not sprinkling chaos. You’re letting the system explore a constrained space without collapsing onto the obvious path too early.
Temperature, top-k, top-p - these are knobs on exploration. They open or tighten the corridor. But the corridor is still carved by training and context.
So “randomness” isn’t noise. It’s bounded wandering through a learned landscape.
That’s why it can feel creative. Not because the model is free. Because it’s constrained in a way that still leaves room.
And yes. This is also where the “hallucination” story lives.
Not as a gotcha. As a predictable failure mode of the same machinery.
When the context is thin, when the question outruns what’s grounded, when the constraints don’t actually determine a single safe path forward, the model will still do its job: it will keep the structure intact.
Sometimes that means it lays a plank across a gap. A fluent bridge over missing support.
And the probabilities aren’t calibrated to truth, so the model can sound certain simply because a completion is statistically typical.
It isn’t trying to deceive. It isn’t “low confidence” in any human sense. It’s completing under constraint, and occasionally the constraint is mostly style.
Professionally, you guard against that by adding what the base model lacks: retrieval, memory, tools, verification, refusal policies, loop-closures that force the system to pay rent in the world as it is before it speaks with authority.
(And yes: sometimes this is just teaching the system to say no; a lesson many of us could learn from.)
In other words: hallucinations aren’t evidence that prediction is shallow. They’re evidence that prediction, by itself, is not enough.
Queryable compression
A zip file is compressed. A rock is compressed history. Neither is especially interesting until you can interrogate it.
What makes a language model different isn’t only that it’s compressed. It’s that its compression is conditional.
A prompt isn’t merely input. It’s a question posed to a learned landscape.
Press here and the system gives you explanations. Press there and it gives you counterexamples. Press again and it rewrites itself in a different voice, from a different angle, under a different set of constraints.
This is why “autocomplete” is both accurate and misleading. Yes: it completes. But what it’s really doing is answering a question:
Given this context, what’s the next insight that keeps the structure intact?
A phrase we can keep:
Distilling the statistical shape of a world into a medium where it can be interrogated.
That’s the fossil engine. Not a fossil you chisel. A fossil you ask.
The stack around the model
One boundary we’ll need repeatedly.
A base model - the trained weights - really is a learned constraint field plus a forward pass. It has no native persistence. No episodic continuity. No real-world reach.
But that isn’t what people are actually interacting with anymore.
They’re interacting with a system:
a model
plus a context window
plus retrieval
plus memory
plus tools
plus critics / verifiers / guardrails
plus artifacts that persist
plus a human shaping what matters
This is not philosophy. It’s engineering.
And it changes what the thing can do.
At some point it becomes hard to say what the “model” can do, because the real unit of competence is often the loop.
A bridge: from completion to deliberation
Once you build systems like this, something begins to happen almost automatically.
You stop treating the model as a one-shot answer machine. You let it take a run at the problem, look at what it produced, and try again under sharper constraints.
That loop is the first real opening. Not because anything mystical happened in the weights. Because you became the second pass.
You became the critic. The editor. The person who can say: closer, now tighten it.
And if you’ve lived inside that collaboration, you know what it feels like: the system isn’t only producing text anymore. It’s producing candidates, and you’re learning to steer by constraint rather than by vibe.
Because iteration is a critical part of how you get reliability out of a probabilistic generator.
Then comes the step where the floor shifts a little: once you’ve discovered that a loop is what turns completion into something like deliberation, you can stop requiring the human to be the loop - you can fold the loop inward.
You ask the system to draft. Then to interrogate its own draft against the constraints. Then to rewrite. Sometimes to generate multiple paths and compare. Sometimes to propose objections and answer them.
The point isn’t any one technique. The point is architectural:
A process that used to happen across turns, human prompt, model output, human correction, can be staged inside the system before anything is shown.
And once you see it that way, the “inner voice” stops sounding like a metaphor. It starts sounding like an interface you didn’t realize you’d designed.
That is what I mean by a private workspace. A place where candidate moves can be proposed, tested against constraints, and revised before anything is committed publicly.
And once a system is doing that, once it’s using its own predictions as raw material for further prediction, you get something that looks, from the outside, like an inner voice.
Not a ghost in the machine. Not mysticism. Just next-token prediction running in a loop, using language to probe its own constraint-field before it speaks.
The inner voice as workspace
In practice, that private workspace means the answer you see is rarely the first thing the system produces.
Instead there’s an internal drafting stage, sometimes explicit, sometimes hidden, where it explores, checks, and reshapes its own output before it commits.
In those systems, the model is effectively writing text that it will later read, sometimes literally as hidden tokens, sometimes as internal candidates or drafts the user never sees, but in either case it still steers what comes next.
Sometimes that workspace looks like:
making a rough pass, then tightening
generating multiple candidates, then selecting
creating a plan, then executing
running a self-critique, then revising
Implementation details vary. The function doesn’t.
A workspace like that means the system can query its own compressed structure before it speaks.
And the moment it can do that, the phrase “just token prediction” starts to feel like describing a city - by naming concrete.
Because now prediction isn’t only output. It’s substrate. It’s what the system uses to think.
Closing loops: when constraint acquires continuity
Once you give the system persistence, memory stores, evolving projects, durable artifacts, the past starts pressing on the future.
Once you give it tools, search, code, databases, it can reach out and correct itself against the world.
Once you add critics and verifiers, it starts behaving less like free association and more like guided exploration.
None of that implies consciousness.
But it does imply something that looks familiar:
Continuity.
Not “a soul.” A continuity of constraints. A style that persists. A trajectory that can be followed. A system that becomes legible across time.
Hold that.
Because once you can see continuity emerging from constraint in a machine, it becomes harder to pretend you’ve never seen that pattern anywhere else.
Somewhere, very close to home, you already have all of this: a private workspace, a control layer, parts of you that are “just” predictors, until you notice what that really means.
The first bridge: a control layer that holds shape
If you want the smallest, most familiar place to begin, begin with something you can notice in yourself.
When a problem is easy, you don’t narrate. You act.
But when a problem is hard, when there are competing constraints, when the next move isn’t obvious, you recruit a different layer.
You hold context. You rehearse. You compare. You slow the whole situation down until it becomes inspectable.
That layer isn’t “the mind.” It’s a control system inside the mind.
Neuroscience has many names for the circuitry involved, but the functional picture is stable: there are networks whose job is to maintain task context, keep goals active, suppress distractions, and coordinate other systems long enough for deliberate choice to happen.
This is the first brain-slice that begins to echo the loops we built around the model.
(A small historical echo we can keep in our pocket: “neural networks” were named that way on purpose. Not because they are neurons, but because the original inspiration really was biological. We’ll use that carefully, if at all.)
A language model, by itself, is a forward pass. A control network, by itself, is not a person.
But when you wrap each one in loops, when you let the system carry constraints forward, revise, and commit, something legible starts to appear.
Not a ghost. A shape.
The workspace you live inside
Most of us experience that control layer, when it makes itself audible, as inner speech.
A sentence forming in the dark. A rehearsal. A small courtroom. A draft we don’t publish.
It’s easy to assume the voice is the thought. It isn’t.
It’s a tool.
Inner speech is what happens when the brain recruits language machinery to probe its own uncertain state. It’s where ambiguity gets pinned down into words long enough to be examined.
And it’s also where we make a very common mistake.
We locate the self where the words are.
Because that’s where it feels like authorship lives. That’s where “I” seems to speak.
But the voice is not the author. It’s an interface.
Plenty of thinking happens before words appear. Perception is already predicting. Memory is already biasing interpretation. Motor systems are already preparing actions. Emotion is already shaping salience.
The words arrive late, when the system needs an explicit handle.
You can feel this if you watch it closely. Sometimes the conclusion is already there, and the inner voice is only the act of making it explainable. Sometimes the voice is genuinely exploratory. And sometimes it’s just the brain practicing a story it hopes will be true.
In other words: inner speech is a workspace. Not a throne.
The mistaken location of the self
Here’s a line I want to hold onto, and I mean it in a specific, clinical sense, not as a sweeping claim about human potential:
People don’t lose intelligence so much as they lose continuity of self.
When certain control circuits are impaired, when planning, inhibition, and sustained context fall apart, what disappears is not “raw cognition.” It’s the ability to keep a trajectory. To carry a goal across time. To remain the same person from one moment to the next.
From the inside, that can feel like a damaged self.
But the deeper lesson is almost the opposite.
The self was never a single location. It was always an emergent continuity. A stable ridge that appears when many subsystems stay aligned long enough.
You are not a patch of cortical neurons. You are the loop that holds.
Back to the model: weights versus framework
This is why the “LLMs are just token predictors” line bothers me in a particular way.
It’s not that it’s false. It’s that it commits the same boundary error we commit about ourselves.
We point at the weights and pretend we’ve described the system.
We point at the control layer and call it the self.
We point at a model and call it the whole agent.
In both cases, the thing that produces legible behavior is what happens when you close loops around that base.
A trained model plus memory plus tools plus evaluation behaves differently than the bare forward pass.
A brain’s predictive machinery plus control networks plus a body plus a world behaves differently than any one module.
The most interesting questions often live at the boundaries.
So pull back one click.
A mind is not one predictor
Yes. Pull back one click more, and the “just” collapses again.
The brain is not a single model. It’s an ecology.
Vision predicts. Hearing predicts. Your motor system predicts the consequences of motion. Your interoceptive system predicts the body. Your social cognition predicts other minds.
Much of what you call perception is the brain’s best guess under constraint. Much of what you call memory is the residue of past constraint shaping future guesswork.
The control layer doesn’t replace this. It coordinates it.
And when language is recruited, it becomes a special kind of probe: not because words are truer than sensation, but because words can be held steady long enough to deliberate.
They are compressions you can put your finger on.
They turn the private state into an object the system can manipulate.
Once again, that’s the resonance with all of these model-based systems: compression becomes useful when it becomes interrogable.
But an echo isn’t an identity. Before we pull back again, I want to pin down what I’m not claiming.
A note on what I am not claiming
So here’s what I am not claiming in this essay.
I’m not claiming that language models are conscious.
I’m not claiming a model is a brain.
I’m not claiming “prediction” is all a mind is.
And I’m not claiming the universe is literally a computer.
Nor am I taking a stand on where the observer “sits” inside our own circuitry.
I’m tracing a recurring shape:
history → compression → constraint → interrogation → deliberation → loop-closure → continuity
If that shape holds up, then “just token prediction” isn’t wrong. It’s incomplete.
And once you notice that incompleteness in machines, it becomes harder to ignore it everywhere else.
So let’s take the simplest, most familiar instance of the shape: memory.
Memory, again - but cleaner
We tend to imagine memory as storage. A vault. A library. A set of snapshots.
But in both brains and models, the more faithful description is simpler and stranger:
Memory is what the past is allowed to do to the future.
In brains, experience reshapes what comes easily next. In models, training reshapes what comes likely next.
The past persists as a thumb on the scale.
Once you see memory that way, it stops being a special faculty and starts looking like a repeatable trick.
Which means we can shift the scale again, carefully, without changing the core idea.
Carry it to a system with no mind at all, and the pattern still holds: evolution.
Evolution: memory with no one home
At the next scale, evolution starts to look less like a separate topic and more like the same story at a different resolution.
Evolution doesn’t remember organisms. It doesn’t preserve experiences. It doesn’t store a past.
It biases a future.
Selection is a filter that runs across generations. Things that work don’t get archived. They get repeated.
A genome isn’t a diary - it’s a compression. A summary of what didn’t die.
Anchor (technical): Evolution is differential reproduction under constraint. The “memory” is the changed distribution of traits that persists forward.
If that sounds too abstract, make it tactile.
A desert “remembers” water scarcity as spines. A prey animal “remembers” predators as eyes that panic early. A human “remembers” social complexity as a cortex that won’t stop modeling other minds.
No mind required at that scale. Just history settling into form.
But once minds show up inside the system, history learns a new trick: it can start writing itself outside the body, into physical artifacts that can be revisited, shared, and interrogated.
Culture: constraint that can be queried
Then the pattern learns a new trick.
Once humans show up, memory stops being only biological. It steps outside the skull.
We build constraints into artifacts. We store them in language, rituals, laws, tools, institutions, code.
And those constraints aren’t inert. They can be interrogated.
And this is where the “fossil engine” returns with teeth.
A legal system is accumulated history that answers conditional questions. Not because it’s alive. Because it has structure.
“What happens if I do this?”
You can ask that question of a culture. You can ask it of a codebase. You can ask it of a constitution.
Not in poetry. In practice.
One of civilization’s real tricks is that we learned how to make history usable.
And the deepest version of that trick is physical: traces written into the world whether anyone intends them or not.
Time: the bill for keeping history honest
If memory is the past constraining the future, physics is the most unforgiving form of it.
In the microscopic laws, much is reversible in principle. In the macroscopic world we inhabit, reversals are almost never seen.
Why?
Because interactions don’t merely happen. They proliferate. They spread correlations outward. They write traces into degrees of freedom we don’t track.
A glass shatters and the information about that shattering disperses into heat, sound, microscopic motions. To “unshatter” it, you would need an implausible coordination of an astronomical number of parts.
That is what irreversibility is. Not a prohibition. A practical impossibility born from combinatorics.
Anchor (technical): The second law is statistical. Entropy increases because there are vastly more high-entropy microstates than low-entropy ones. The arrow is what typicality looks like.
So time’s arrow isn’t a new fundamental law stapled onto physics. It’s what record-writing costs.
History becomes harder to undo the more widely it is written.
And once you notice that record-writing is doing the same kind of work as memory, turning past interaction into future constraint, it’s hard not to see the rhyme elsewhere.
The wider pattern
Which is why taking “just token prediction” at face value starts to feel solidly incomplete.
Not because prediction is trivial. Prediction is critical. But because prediction is only one face of a much deeper pattern:
history compresses into constraint, constraint becomes interrogable; a handle you can pull, and that handle enables deliberation, and deliberation, when looped, lays down continuity.
The model does it in silicon. A brain does it in cortex. Evolution does it in populations. Culture does it in artifacts. Physics does it in irreversible traces.
Different machinery. Same shape.
And the temptation, at this point, is to mistake that shape for an answer.
What I’m trying to protect
This is the point where it becomes easy to reach for sweeping gestures. To claim the universe is a computer. To claim models are alive. To pretend the mystery is solved.
That’s not what I want.
I’m avoiding any crutch. My underlying purpose is my own exploration of truth, and I want something stricter than getting lost in the mysterious.
And I want the reader to feel the depth of a mechanism without escaping into mythology. To feel awe without fog.
Because “just” is not humility. It’s impatience.
And if we can unlearn that impatience, in this very modern example, then we might also unlearn it in the places where it actually matters:
in how we describe minds, in how we treat memory, in how we understand agency, in how we locate the self.
So let me end where we began; only now the depth is visible.
Final synthesis
So yes.
LLMs predict the next token.
And that’s the gate.
It only works when a system has already become a distilled world; when the past has been compressed into a terrain that can be queried.
Once you see that, the word “just” stops doing any work.
You start asking one of the questions that has always mattered most:
What does history become when it is forced to fit inside a finite state, inside any finite mechanism?
Sometimes it becomes weights. Sometimes it becomes cortex. Sometimes it becomes genes. Sometimes it becomes law. Sometimes it becomes the scaffolding of what comes next.
And sometimes, when enough traces accumulate, it becomes the felt continuity of being something at all.
Topics to explore (and a few trailheads)
I’m not pretending this is a bibliography.
It’s a map of the underlying neighborhoods this essay touched, some I’ve read closely, some I only know by reputation and osmosis, and some I’m still working my way through.
Think of this as trailheads. Some I’ve walked slowly, some I’ve only glimpsed.
The point is direction, not authority. If you want to go deeper, these are a few doors.
Language models: transformers, scaling, and what “next token” really means
Attention Is All You Need: Vaswani et al. (2017)
Scaling Laws for Neural Language Models: Kaplan et al. (2020)
Decoding and the controlled use of chance
The Curious Case of Neural Text Degeneration (top‑p / nucleus sampling): Holtzman et al. (2019)
Loop-closure around the base model: retrieval, tools, verification
Retrieval‑Augmented Generation: Lewis et al. (2020)
Coherence as a technical term (not a vibe)
“Centering” / discourse coherence (a classic thread): Grosz, Joshi & Weinstein (1995)
Entity‑based coherence models: Barzilay & Lapata (2008)
Cognitive control and inner speech
PFC as context maintenance / control: Miller & Cohen (2001)
Inner speech as a cognitive tool: Alderson‑Day & Fernyhough (2015)
Evolution as memory without a mind
The Selfish Gene: Dawkins (1976)
Records, irreversibility, and time’s arrow
Landauer (1961) on information and physical cost
Zurek (2009) on redundancy/records and “objectivity” (Quantum Darwinism)
Carroll (2010), From Eternity to Here


