The world is more detailed and more intricate than we ever fully take in. We reach it through perception, which is bounded by the range and resolution of our senses and registers only part of what is there. Over perception we lay a symbolic order: words, numbers, models, measurements, and the like. We call something red, then name its shade, then fix it in a color space; we encode the world in species, units, equations, and so on. Much of what we call ‘knowledge’ lives in this top layer, in part because symbols are compact and easy to share.
A representation holds only what was put into it. It can be recombined, scaled, and refined, and that does not, on its own, add what was never captured; such things generally have to come from the world. When what a task needs is already in the symbols, the problem is largely amenable to data and scale. When it is not, work inside the representation tends to stop paying off, and what is missing, whether a property never measured or a behavior only crudely modeled, usually has to be sought in the world.
The symbolic gap
Nearly every symbolic system leaves a residue it cannot express. An alphabet has sounds it cannot write; a language has distinctions it has no words for. Even vision is not a photograph: an RGB image is a uniform grid of samples, while the eye samples unevenly, concentrating acuity at the fovea. Neither the image nor the eye is the scene itself; each keeps part of it and discards the rest, in its own way. Omission is part of what makes a representation useful; a description that left nothing out would be as unwieldy as what it describes. What matters is which omissions it makes, and whether the task depends on what was left out.
Robot learning
These gaps show up concretely in robot learning, where a machine acts on the world through such a stack, and the gaps there can carry a measurable cost. It helps to separate that stack into three, less a theory than a convenience for naming where things come apart. $\Phi$ is the physical world the robot acts in but does not observe directly, including its contact, friction, deformation, surface properties, and lighting. $P$ is perception, the lossy pipeline of cameras, encoders, and force-torque sensors that samples it. $\Sigma$ is the representation the policy computes over: tensors, latents, tokens, simulator state. A simulator lives in $\Sigma$ and approximates $\Phi$ as well as its numerics allow.
graph TB
subgraph PhiBox[" "]
F["<b>Φ: the real</b><br/>contact, friction,<br/>deformation, lighting,<br/>surface properties<br/><i>inaccessible</i>"]
end
subgraph PBox[" "]
Per["<b>P: perception</b><br/>cameras, IMU, encoders,<br/>force/torque sensors<br/><i>lossy</i>"]
end
subgraph SigmaBox[" "]
Sym["<b>Σ: symbolic</b><br/>tensors, latents, tokens,<br/>action chunks, sim state<br/><i>what the policy computes over</i>"]
end
Sym -->|"action via actuators"| F
F -->|"sensing"| Per
Per -->|"encoding"| Sym
style F fill:#1a1a1a,stroke:#86efac,stroke-width:2px,color:#86efac
style Per fill:#1a1a1a,stroke:#60a5fa,stroke-width:2px,color:#60a5fa
style Sym fill:#1a1a1a,stroke:#fbbf24,stroke-width:2px,color:#fbbf24
style PhiBox fill:none,stroke:none
style PBox fill:none,stroke:none
style SigmaBox fill:none,stroke:none
classDef default font-family:Source Code Pro
A prominent case is contact. A simulator does model it, but only approximately, and tends to handle foot-on-ground contact better than finger-on-object contact. So legged-locomotion policies can be trained in simulation and carried over to hardware, and they are small, around 1.4 million parameters (Radosavovic et al., 2023). Contact-rich manipulation is far harder to train this way, and tends to fall back on demonstrations collected by teleoperation. These are still lossy, but sampled at higher fidelity, closer to the real than any model of it (Brooks, 1990). Those demonstrations are real-time and costly, often person-years of operator effort, which is part of why contact looms large in the field.
One temptation is to avoid that cost by staying inside the model and fusing what is already there: a multiphysics solver coupling several equations, or a network folding many channels into one representation. But fusion mostly recombines what was already encoded, and does not by itself add a component that was not there. What a model lacks tends to come from the world, through a new kind of measurement or a real interaction, rather than from rearranging what it already holds.
Scale and span
Scale is powerful where a representation already spans the task: it fills the space in, and is often hard to beat. Language shows this especially clearly, since text already carries most of what a language model needs. It is the bitter-lesson bet from language, that general methods with scale and data tend to win out, and web pretraining even reaches robots: a model can pick a rock when the task calls for an improvised hammer, with no robot data teaching it the concept (RT-2). But that web pretraining seems to transfer recognition more than execution, what a scene contains rather than how to act in it. Execution turns on the fidelity of the dynamics a policy learns from, whether from a faithful simulator or from real data.
Where a dimension is genuinely missing, scale appears to do much less. A policy that must tell a wet surface from a dry one needs information its cameras may not carry; a visible sheen is a weak proxy for wetness, not a measurement of it, and more images only give more of that proxy, not the measurement. Telling a genuinely absent property from a merely faint one is hard, and scale sometimes recovers what looked missing, so it is easy to call something absent that was only hard to detect. Our measures of progress are representations too, and they can drift from what they track. A benchmark score, how often a model succeeds on the test, can stay high while the capability behind it does not: when the test’s scenes are meaningfully changed (an object moved, an instruction reworded), that success can fall sharply, though newer policies hold up better (LIBERO-PRO). Such a score reflects the benchmark more than the task it stands for. What decides it, in the end, is the fit between the data and the task.
Closing remarks
Underneath the specifics the claim is general. We act on the world through representations, and a representation gives back essentially what was put into it; scale and refinement work within what has already been captured, while what was never captured generally has to come from the world. Robot learning makes this concrete and expensive: what a simulator approximates too crudely, or a sensor does not record, usually has to be bought in real interaction. What this asks for is not less empiricism but more of the right kind: build the representation to hold what the task depends on, and then scale along it.
Related ideas
The thought is old, and it recurs in many vocabularies. Korzybski’s map–territory distinction names the gap between a description and the thing described, and Borges pressed it to a limit: a map so exact that it covers the territory point for point, and is abandoned as useless. Kant drew a sharper line, between appearances and the thing-in-itself we never reach directly, and Lacan called the part that resists symbolization the Real. Wiener’s Cybernetics treated information as a quantity of its own, not reducible to matter or energy, and Shannon’s data-processing inequality set a limit on it: no processing recovers information a signal never carried. In artificial intelligence the same concern runs through Harnad’s symbol grounding problem, how symbols mean anything if defined only through other symbols; Brooks’s argument that a robot does better to treat the world as its own model than to carry an internal copy of it; and Moravec’s paradox, that the sensorimotor skills we share with animals have proved far harder to reproduce than abstract reasoning. Polanyi’s tacit knowledge, what we can do but cannot state, is the same gap seen from the other side.
References
- Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.
- Brooks, R. (1990). Elephants Don’t Play Chess. Robotics and Autonomous Systems.
- LIBERO-PRO (Zhou, X. et al., 2025). Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization.
- Radosavovic, I. et al. (2023). Real-World Humanoid Locomotion with Reinforcement Learning.
- Sutton, R. (2019). The Bitter Lesson.