Picking up language from experience

representation of pulsar model visualizing arc agi ls20 game environment

Most language models learn the world backwards. They begin with text — a record that other minds produced after they already understood the things the text describes — and try to recover, from the statistics of that record, the understanding that generated it. This works remarkably well for problems that live inside text. It works less well the moment a system has to act in, or reason about, a world it has never perceived.

At Serotonin we are building language the other way around. Our systems acquire language the way it was first acquired anywhere: as a layer that settles over experience once there is enough experience for it to describe.

A picture is worth a thousand words

The core text bottleneck is that a sentence is a compression of something. "The glass slipped off the counter and shattered" stands in for a dense, continuous event: the slow tilt, the fall, the acceleration, the impact, the spray of fragments, the sound. A reader who has handled glass decompresses that sentence effortlessly, because the words point at structures they already hold. A system trained only on text inherits the pointers but not the structures. It learns which words tend to follow shattered; it does not learn what shattering is.

This is a modern form of the symbol grounding problem. Distributional learning over a large corpus captures the co-occurrence geometry of words with real fidelity, and that geometry encodes a great deal — because the humans who wrote the corpus were themselves grounded. But the meaning a text model recovers is meaning held secondhand. It is a shadow cast by other people's experience, and like a shadow it flattens. The places where text models are brittle — physical reasoning, counting, spatial relationships, cause and effect, anything that depends on knowing how the world actually moves — are the places where the missing dimension shows.

What grounding requires

No child learns language from a corpus. A child moves through a world that persists, that occupies space, and that changes over time. They push objects and watch them fall. They occlude one thing with another and find it still there when the occluder moves. They act, observe the consequence, and quietly update an internal model of how things behave. Words arrive late to this process, and when they arrive they attach to a model that is already richly structured. The word does not build the concept; it labels one that perception has already isolated.

World models

So we build the understanding first. A Serotonin world model represents the world not as a stack of static frames but as something that endures and transforms through time — three spatial dimensions plus temporal one along which everything plays out. Object permanence, motion, contact, deformation, cause preceding effect, the affordances a scene offers an agent: these are properties of the world, and a model that only sees snapshots cannot represent them cleanly.

The model is trained the way a child's expectations are trained — by prediction. Where a text model learns to anticipate the next token, ours learn to anticipate the next perceptual state: what the world will look like a moment from now, what will happen if the agent acts, what should follow from what came before. Prediction under this pressure forces the model to discover the regularities that make the future predictable at all. A falling object, a rolling object, an object about to topple — each becomes a reusable latent structure in the model, isolated and stable, before any word is attached to it.

Open-endedness

A child is not handed a fixed dataset and a fixed labeling scheme. Neither are our systems. They explore, set their own goals, and seek out the situations that are most informative to them at a given moment. Novelty is the signal that drives learning forward. New vocabulary grounds as the agent encounters the experiences that vocabulary refers to, and the process does not terminate. There is no point at which language acquisition is declared finished, because there is no point at which experience is.

This open-ended, experiential regime is what lets grounding scale. Rather than curating every concept a system should hold, we build an agent that goes and acquires them — and acquires the words for them in the same motion.

What we are seeing

The behavior this produces is recognizable. Terms grounded in one setting carry over to settings the system has not been trained on, because they were never tied to the training setting in the first place — they were tied to events. The system can describe situations it was never given words for by composing grounded parts. Its language and its predictions stay consistent with each other, because both draw on a single underlying model of the world rather than on two separate competencies that have to be kept in alignment.

None of this is finished work, and we are deliberate about not overstating it. Grounding is hard, open-ended learning is harder, and a 4D world model is an expensive thing to build well. But the direction has held up under pressure where the text-first approach tends to fray, and it has done so for a reason we find clarifying: meaning was never in the words. It was in the experience the words were standing in for. We are building the experience first, and letting the language follow.