I ran an fMRI on LLMs: a concept is a direction, not a region
TL;DR I've been running an "fMRI for LLMs" โ capturing the full internal activations of dense open models (Qwen2.5-7B, Gemma-2-9B, Gemma-4-12B) and applying neuroscience methods to map how meaning is organized. The headline result, confirmed causally and across all three models: a concept is not s

TL;DR I've been running an "fMRI for LLMs" โ capturing the full internal activations of dense open models (Qwen2.5-7B, Gemma-2-9B, Gemma-4-12B) and applying neuroscience methods to map how meaning is organized. The headline result, confirmed causally and across all three models: a concept is not stored in a region of neurons โ it is a single direction in activation space. direction, not a region In the brain, categories live in localized regions (faces โ fusiform face area). LLMs are the opposite. Distributed, superposed code. A 10-way category linear probe decodes far above chance (Gemma-2 0.97, Qwen 0.80), yet the "most selective" units do not replicate across two random halves of the stimuli (overlap โ 0.00โ0.05). There is no findable "animal neuron." Causal proof. Ablating the 20 most selective units changes downstream category accuracy by ~0 (same as removing 20 random units). But ablating one distributed direction collapses it โ mean ฮAUC up to +0.52 (Qwen). True in all 3 models. So category is localized to one direction but that direction is spread across ~2000 of 3584 neurons, and which neurons is non-reproducible. Localization is in vector space, not anatomy. The residual stream is a shared additive bus. Injecting a concept direction at N consecutive layers equals injecting Nร the magnitude at one layer โ ratio = 1.00 for every N. The stream literally sums contributions across layers. Only relative magnitude codes. Scaling the whole residual 0.25รโ4ร โ zero output change (RMSNorm divides it out). Scaling only the component along the concept direction โ a clean monotonic concept shift. Meaning = the projection along a direction, not the vector's length. Under strict controls (120 stimuli/category, an architecture-matched untrained twin, word-grouped splits so no frame leaks across train/test): A concept is essentially rank-1 โ one direction, present at every depth (decodable layer-span: trained 1.0 vs untrained 0.0). Narrow in width, broad in depth. Concepts coexist additively. One shared probe reads each category as well as a dedicated probe (retention 1.00) โ they're linearly superposed and read in parallel. Direction is the whole code. A nonlinear MLP probe fails to beat a single linear direction (gap โค 0 in all models), even with 1200 stimuli. "Meaning = direction" isn't an approximation; it's the code. Property Brain Dense LLM Verdict Small-worldness / rich-club hubs yes yes (ฯ up to 12.8) match Network modularity Q 0.30โ0.50 0.09โ0.23, rising each generation partial Category-selective regions yes (FFA/PPA) no (distributed direction) differ Topographic maps (retinotopy etc.) yes no (~20โ40ร below cortex) differ Cross-model universality (CKA) โ 0.69โ0.77, cross-family Platonic convergence Two bonus results worth flagging: Steerability is predicted by encoding dimensionality (r โ โ0.83): concepts packed into ~1 direction (numbers, colors) steer cleanly; high-dimensional concepts resist. A wiring-cost penalty makes a small transformer more modular (ฮQ > 0 in 4/4 seeds, with a non-monotonic sweet spot) โ direct evidence that the brain's modularity is partly a consequence of physical embedding constraints that transformers normally lack. The harness has an adversarial verification gate, and several appealing hypotheses died in it: "abstraction velocity predicts capability" was rejected on a clean 5-point Qwen ladder; the flashy "60ร more localized in SAE features" shrank to a modest 2.4ร under a gold-standard pretrained Gemma Scope SAE; cross-model feature-level universality is only partial. Reported as nulls, not spun. Method: dense models scanned on Apple Silicon (MPS), neuroscience-style analysis pipeline (linear probes, RSA/CKA, functional connectome graphs, causal patching, SAEs, steering). Every number is traceable to a data file. Feedback welcome.
Key Takeaways
- โขTL;DR I've been running an "fMRI for LLMs" โ capturing the full internal activations of dense open models (Qwen2.5-7B, Gemma-2-9B, Gemma-4-12B) and applying neuroscience methods to map how meaning is organized
- โขThis story was reported by Dev.to, covering developments in the dev space.
- โขAI advancements continue to reshape industries โ read the full article on Dev.to for complete coverage.
๐ Continue reading the full article:
Read Full Article on Dev.to โShare this article



