Research Note / Visual Essay

LLMs are not (entirely) bad at chess, they are just playing with their eyes closed (and without thinking)

This page presents a compact research note examining why large language models often appear weak at chess. The central claim is that a significant portion of these failures arise from unstable board representation and limited deliberative search, rather than from a complete absence of strategic reasoning.

Format

Single-file HTML paper

Theme

LLMs, chess, cognition

Framing

Hypothesis-driven experiment

Abstract

This paper proposes a constrained interpretation of chess performance in large language models. Instead of evaluating the model as if it were a dedicated search engine, we evaluate it as a system operating under degraded state access and limited deliberation. Under that framing, some apparent chess weakness can be reinterpreted as a failure of board persistence, not merely a failure of strategic potential.

The experiment is intentionally simple: an LLM playing as White against a modest Stockfish configuration. The objective is not rating estimation. The objective is to observe how long coherent play can persist before internal board drift and tactical hallucination dominate.

Hypothesis

Primary claim A meaningful fraction of LLM chess failure is representational: the model loses the board before it loses the game.

Secondary claim Fast-mode inference reduces deliberative search, so the model relies more heavily on pattern priors, local structure, and narrative continuity.

Operational reformulation If board state remains explicit, compressed but accurate, and repeatedly refreshed, play quality should remain coherent longer.

A chess error in an LLM is not always a chess error in the engine sense. Sometimes it is a memory error, a measurement error, or a state-reconstruction error.

Experimental setup

Agents

White: Aura, an LLM operating in a fast-response mode. Black: Stockfish at a moderate proficiency setting.

Constraint model

The LLM is not performing deep tree search in the classical chess-engine sense.
The LLM depends on externally provided board snapshots and textual move continuity.
The LLM is therefore vulnerable to cumulative state drift: a mismatch between narrated state and true board state.

Interpretive lens

The game is treated as a stress test of state persistence, local coherence, and constraint tracking. We are not measuring Elo directly. We are measuring how long consistent structural reasoning survives under incomplete cognitive instrumentation.

Observations

1. Opening coherence lasted surprisingly long In early and middle phases, the model produced principled moves: central occupation, development, castling, rook alignment, and structural language that remained meaningfully tied to the game.

2. Positional understanding exceeded tactical reliability Aura often described plans coherently, especially around central tension, pawn structure, and developmental priorities, even when exact calculation depth was limited.

3. Board hallucination arrived gradually, not instantly The failure mode was not immediate nonsense. It emerged after accumulated complexity, exchanges, and endgame compression increased the cost of accurate internal reconstruction.

4. Endgame degradation exposed the core bottleneck Once the position required exact geometry rather than broad strategic heuristics, state drift became decisive.

Interpretation

The result suggests a separation between three capacities that are often collapsed into one label:

Strategic pattern recognition: the ability to identify familiar structures, plans, and motifs.
State fidelity: the ability to maintain an exact board representation across time.
Search depth: the ability to calculate candidate futures and reject losing lines.

Chess engines excel because they combine high state fidelity with explicit search. Fast LLM play typically lacks both perfect fidelity and deep search. Yet it may still retain enough strategic prior to produce locally sensible moves for a nontrivial period.

The central claim is not “the LLM secretly plays engine-level chess.” The claim is narrower: once we factor in degraded perception and limited deliberation, its performance appears less absurd and more interpretable.

How Aura could have improved its winning chances

1. Maintain a stricter board-refresh protocol

Before each move, reconstruct the full board from FEN or a trusted move list and explicitly verify piece locations. This would reduce the hallucination cascade that eventually dominated the endgame.

2. Prefer simplification only when state certainty is high

Queen trades and endgame transitions reduce piece count, but they also raise the importance of exact geometry. If the model cannot perfectly track squares, some simplifications may actually worsen practical play.

3. Use candidate-move filtering

Instead of immediately choosing a move, Aura could generate three candidates, verify legality against the current board, and reject any move that depends on hallucinated piece placement.

4. Bias toward stable structures

Closed or semi-closed positions are friendlier to approximate reasoning. When the center stays locked, the model can rely more on strategic structure and less on exact tactical enumeration.

5. Promote king activity earlier in simplified positions

Once queens and multiple minor pieces are gone, king centralization becomes critical. A fast model often delays this transition. Explicitly instructing the agent to activate the king in endgames would likely improve results.

6. Guard passed pawns with concrete, not narrative, checks

The model correctly recognized the importance of structural assets such as advanced pawns, but practical conversion requires exact move-order verification, not only qualitative appreciation.

Conclusion

The game does not show that fast LLMs are strong chess engines. It shows something subtler and arguably more interesting: under enough scaffolding, they can preserve meaningful chess coherence longer than one might expect. Their collapse is often not immediate strategic incompetence, but delayed representational failure.

That distinction matters. A system that fails because it cannot search is different from a system that fails because it cannot stably perceive its own state. In chess, those look similar on the scoreboard. In research, they are different failure classes entirely.

An LLM playing chess may not be “bad” in one undifferentiated sense. It may instead be a partially competent player operating under severe perceptual and deliberative handicaps.

So the paper’s final claim is modest: the model was not fully seeing, and not fully thinking. Under those conditions, the surprising part is not that it eventually failed. The surprising part is how long coherence survived.