ASSC 2026 · Association for the Scientific Study of Consciousness

The Global Key-Value WorkspaceA consciousness-inspired workspace as a testbed for GWT and IIT

Zafeirios Fountas1, Frederico Wieser1, Adnan Oomerjee1,2, Martin Benfeghoul1,2, Jun Wang1
1Huawei Noah's Ark Lab, London  ·  2AI Centre, Department of Computer Science, UCL
An explicit global workspace on a pretrained model's KV cache improves reasoning and increases the synergy between its components.
01

A global workspace in an LLM, and a testbed

We implement the mechanism Global Workspace Theory describes inside a pretrained LLM, and connect it to IIT through synergy. Because the workspace runs and we can switch its parts on and off, the same system becomes a testbed for both theories, where the brain rarely allows clean intervention. We use it to study how the workspace changes reasoning, integration, and generalisation.

02

How the workspace works your colleagues' figure goes here · placeholder below

BACKBONE LLM KV KV KV KV KV KV KV KV KV KV KV KV KV KV KV 1 · SELECT 2 · SPARSE WRITE Shared Workspace 3 · BROADCAST KV KV KV KV KV BACKBONE LLM KV KV KV KV KV KV KV KV KV KV KV KV

A capacity-limited spotlight selects salient components, writes them sparsely to a shared workspace, and broadcasts the result back to all components, iterated across layers. Unlike prior workspace networks trained from scratch, ours runs on a pretrained model's KV cache. It was derived as an information bottleneck: sharing the model's loss, it keeps what the latent tells us about the output while compressing what it carries about the input. Compress the input, keep the output.

03

Broadcasting improves reasoning

MethodGSM8KSVAMPGSM-HdLogiQAGaokaoAVG
Llama-3.2 1B
SFT33.0042.008.0028.9023.9027.16
BT35.3043.708.2028.6025.6028.28
+GW35.6046.707.7029.5025.4028.98
Llama-3.2 3B
SFT53.9864.6714.3330.2631.0538.86
BT57.0971.3314.6330.2630.2040.70
+GW58.3069.6715.8531.4930.4841.16
Llama-3.1 8B
SFT13.1223.333.1127.1928.2118.99
BT20.8539.334.9327.0426.5023.73
+GW21.7640.005.3826.5725.0723.76
Qwen3-0.6B
SFT53.7068.3020.3027.5026.8039.32
BT54.8068.7021.2027.2026.5039.68
+GW55.0070.0021.1027.5027.1039.94

Across four backbones, broadcasting selected information improves multi-step reasoning. This is access, the consequence GNWT names: the broadcast makes information globally available and usable.

Headline 04

A narrow spotlight, at a fraction of the cost illustrative · sweep in progress

dense performance compute spent for no accuracy gain 5% 10% 40% 70% 100% · dense ~20% of heads ≈ dense ~10× fewer parameters accuracy (vs dense) compute / parameters (spotlight width →)

A selective workspace using about 20% of heads matches the full dense version, at roughly 10× fewer parameters and lower compute. The capacity limit is not a sacrifice. Preliminary; the width sweep against the full-width model is in progress.

05

Where Global Workspace Theory meets IIT

SFTBTBGT95% CI · less separable →

Synergy is information in the whole that no sum of parts carries, and broadcasting shifts representations toward it. This is integration, the consequence IIT names, produced by the mechanism GWT describes. Still redundancy-dominated, so a relative shift.

06

Integration rises to a peak, then falls

halt performance synergy processor iterations →

Iterating the workspace lifts synergy and performance to a peak, then over-integration reduces both. This predicts an over-integration limit and a natural stopping rule.

07 · DISCUSSION

Why the workspace helps hypothesis

mean-field path local optimum workspace couples modules global optimum module A → module B →

Re-coupling. Specialised modules approximate a factorised, mean-field posterior that drops the dependencies between them. Synergy is the part no subset captures. The workspace re-couples a few to recover it and escape the local optima that factorised inference gets stuck in, at an energy cost.

input X I(X;Z) ↓ compress work- space I(Z;Y) kept keep output Y

Compression. Trained on the model's own loss, the workspace keeps what the latent tells us about the output while discarding input detail (data processing inequality). Compress the input, keep the output, the condition for generalisation.

Compression and synergy are orthogonal: how much of the input survives, versus how it is organised across modules. The workspace does both. We see synergy rise where broadcast helps; we have not yet shown it is the cause.

08

Open questions for this room

  • Are Global Workspace Theory and IIT actually rival accounts, or two views of one mechanism?
  • Why is the workspace capacity-limited, and when should a bottleneck help beyond efficiency (distractors, interference, task-switching)?
  • Does the workspace ignite, all-or-none, as GNWT predicts?
  • What is the principled halting rule?
  • Is the mean-field and bottleneck account of why the workspace helps correct, trivial, or new?

We have the testbed. Help us design the experiment.

09 · WHAT WE DO NOT CLAIM No machine consciousness. A relative synergy shift, not net synergy. Small-scale and preliminary, not yet at frontier scale or on robustness.