ASSC 2026 · Association for the Scientific Study of Consciousness
A unifying consciousness theory for reasoning in LLMs
We build an explicit global workspace inside a pretrained Large Language Model, by augmenting the Transformer. It improves reasoning and increases the synergy between its components.
The idea
Architecture
A capacity-limited spotlight selects salient components, writes them sparsely to a shared workspace, and broadcasts the result back to all components, iterated across layers. Unlike prior workspace networks trained from scratch, ours runs on a pretrained model's KV cache, and we measure the integration it produces.
Results · reasoning
Across four backbones, broadcasting selected information improves multi-step reasoning. The workspace does functional work, the access claim made concrete.
| Method | GSM8K | SVAMP | GSM-Hd | LogiQA | Gaokao | AVG |
|---|---|---|---|---|---|---|
| Llama-3.2 1B | ||||||
| SFT | 33.00 | 42.00 | 8.00 | 28.90 | 23.90 | 27.16 |
| BT | 35.30 | 43.70 | 8.20 | 28.60 | 25.60 | 28.28 |
| +GW | 35.60 | 46.70 | 7.70 | 29.50 | 25.40 | 28.98 |
| Llama-3.2 3B | ||||||
| SFT | 53.98 | 64.67 | 14.33 | 30.26 | 31.05 | 38.86 |
| BT | 57.09 | 71.33 | 14.63 | 30.26 | 30.20 | 40.70 |
| +GW | 58.30 | 69.67 | 15.85 | 31.49 | 30.48 | 41.16 |
| Llama-3.1 8B | ||||||
| SFT | 13.12 | 23.33 | 3.11 | 27.19 | 28.21 | 18.99 |
| BT | 20.85 | 39.33 | 4.93 | 27.04 | 26.50 | 23.73 |
| +GW | 21.76 | 40.00 | 5.38 | 26.57 | 25.07 | 23.76 |
| Qwen3-0.6B | ||||||
| SFT | 53.70 | 68.30 | 20.30 | 27.50 | 26.80 | 39.32 |
| BT | 54.80 | 68.70 | 21.20 | 27.20 | 26.50 | 39.68 |
| +GW | 55.00 | 70.00 | 21.10 | 27.50 | 27.10 | 39.94 |
SFT supervised fine-tuning · BT Bottlenecked Transformer · +GW global workspace (dense broadcast). Average over the five tasks; best per column shaded.
Averaged over the standard LM benchmarks · 355–356M params · 20B tokens · matched across TF / BT / BGT.
We observe. Accuracy rises with width and is highest at full broadcast. A narrow spotlight recovers most of the accuracy at far lower compute.
Results · integration
Synergy is the information in the whole minus the sum of its parts, over balanced head partitions. We measure a synergy proxy across heads and compare across models, where more positive means higher synergy. Our model (BGT) tends toward higher synergy.
SFT baseline · BT · BGT 95% CI · less separable →. Still redundancy-dominated, so a relative shift.
We observe. Synergy and performance have a complex relationship. Accuracy peaks within a few steps, then collapses if the processor keeps iterating, even as raw synergy carries on climbing.
Discussion
Specialised modules approximate a factorised, mean-field posterior that drops the dependencies between them. Synergy is the part no subset captures. The workspace re-couples a few to recover it and escape the local optima that factorised inference gets stuck in, at an energy cost.
Trained on the model's own loss, the workspace keeps what the latent tells us about the output while discarding input detail (data processing inequality). Compress the input, keep the output, the condition for generalisation.
Compression and synergy are orthogonal: how much of the input survives, versus how it is organised across modules. The workspace does both. We see synergy rise where broadcast helps; we have not yet shown it is the cause.
Open questions
We have the testbed. Tell us what experiment would convince you.
Stay in the loop
Leave your email and we will send the preprint, the full results, and the codebase the moment they are out. No spam, one message.
Subscribe for updates