o3 Dominates Tetris, Crushes Gemini in New UCSD AI Benchmark

·

The world of artificial intelligence is leveling up — not through abstract reasoning alone, but by playing classic video games. A groundbreaking new benchmark, Lmgame Bench, developed by researchers at UCSD and collaborators, is redefining how we evaluate large language models (LLMs). By leveraging iconic games like Tetris, Pokémon, Sokoban, and 2048, this framework tests AI capabilities across perception, memory, and reasoning in ways traditional benchmarks cannot.

Unlike conventional evaluations focused on math or coding, Lmgame Bench taps into dynamic, interactive environments that mirror real-world complexity. The results? o3, a cutting-edge model, has emerged as a dominant force — especially in Tetris, where it outperformed Google’s Gemini and claimed top honors.

But why are retro games suddenly at the center of AI research?

Why Classic Games Are the Ultimate AI Testbed

At first glance, using 30-year-old games like Pokémon Red to assess state-of-the-art AI seems odd. But there's deep logic behind it.

Enter the Moravec’s Paradox: tasks that are easy for humans — such as recognizing objects, navigating spaces, or making split-second decisions — are incredibly difficult for machines. Conversely, complex logical tasks like chess or theorem proving have long been within AI’s grasp.

👉 Discover how AI is mastering human-like decision-making through gameplay

This paradox explains why even the most advanced LLMs struggle with seemingly simple challenges. While they can generate essays or debug code, they often fail at basic perception-action loops — precisely what games like Pokémon demand.

To beat Pokémon Red, an AI must:

These requirements make Pokémon a powerful proxy for real-world intelligence — far more revealing than static question-answer tests.

Yet until now, evaluations lacked standardization. Anthropic’s Claude used custom tools to read game memory; Google’s Gemini relied on external code to parse game states. These advantages skewed comparisons.

That’s where Lmgame Bench changes the game.

Introducing Lmgame Bench: A Standardized Framework for AI Evaluation

Developed by UCSD and a team of AI researchers, Lmgame Bench introduces a modular, fair, and reproducible way to assess LLMs across multiple classic games. It eliminates reliance on proprietary scaffolding and instead uses a unified Gym-style API — familiar to reinforcement learning practitioners — ensuring consistent input/output formats across all games.

Each game measures different cognitive skills:

🧠 Tetris – Spatial Reasoning & Real-Time Decision-Making

Score = Total blocks placed + (Rows cleared × 10)
Success hinges on rapid pattern recognition and efficient stacking. o3 excelled here, clearing over 10 lines consecutively and maintaining gameplay far longer than rivals — including Gemini 2.5 Pro, which faltered under pressure.

🧩 Sokoban – Planning & Logical Deduction

Score = Number of boxes correctly positioned across levels
This puzzle game demands foresight. One wrong move can create an unsolvable deadlock. o3 mastered even the hardest 1989 levels, showcasing superior long-term planning.

🍬 Candy Crush – Short-Term Optimization

Score = Candies eliminated in 50 fixed moves
Despite its simplicity, this game reveals how well models optimize limited actions. Surprisingly, o3 underperformed, suggesting a gap in short-horizon tactical thinking.

🎮 Super Mario Bros – Physics Intuition & Navigation

Score = Horizontal distance traveled before losing all lives
Models must infer gravity, momentum, and enemy behavior from visuals alone. Strong performers demonstrate implicit understanding of physical laws.

🔢 2048 – Strategic Depth & Memory Retention

Score = Sum of merged tile values until board stagnation
With potential play spans exceeding 100,000 steps, this game tests endurance and consistency. o3 achieved high scores through stable memory retention and recursive planning.

⚖️ Phoenix Wright: Ace Attorney – Contextual Reasoning

Score = Correct evidence submissions and dialogue choices before five failures
This narrative-driven game evaluates contextual comprehension and logical inference — crucial for real-world applications like legal analysis or customer support.

The Three Pillars of Game-Smart AI

Lmgame Bench doesn’t just measure performance — it identifies how models succeed or fail. To do so, it integrates three core modules:

1. Perception Module

Converts raw pixels or UI elements into structured text descriptions. This reduces dependency on fragile vision systems and ensures consistent state interpretation.

2. Memory Module

Stores recent actions, game states, and self-reflections. This enables long-horizon planning — essential for games requiring delayed rewards or multi-stage objectives.

3. Reasoning Module

Synthesizes inputs from perception and memory, optionally activating chain-of-thought reasoning. This mimics human-like deliberation before acting.

Together, these modules simulate a complete cognitive loop: observe → remember → think → act.

Why o3 Took the Crown (And Where It Fell Short)

Among 13 leading models tested, o3 consistently ranked at the top — particularly in Tetris, Sokoban, and 2048. Its strength lies in robust visual processing, advanced spatial reasoning, and exceptional long-term strategy formulation.

However, its weaker performance in Candy Crush highlights a critical insight: high performance in one domain doesn’t guarantee general intelligence. Even top-tier models exhibit skill imbalances — much like humans.

Gemini, while strong in knowledge retrieval, lagged due to reliance on external tools and slower decision cycles. In contrast, o3 operated within the standardized environment without crutches — proving true adaptability.

👉 See how next-gen AI models are being stress-tested in virtual worlds

FAQ: Your Questions About Lmgame Bench Answered

Q: What makes Lmgame Bench better than previous benchmarks?
A: Unlike ad-hoc setups (e.g., custom APIs for Pokémon), Lmgame Bench uses a unified interface and open-source design. This allows fair comparison across models and games without hidden advantages.

Q: Can any LLM participate in Lmgame Bench?
A: Yes — the framework supports any model capable of processing text-based game states. With one command, users can run evaluations for any supported model-game pair.

Q: Is human-level AI close based on these results?
A: Not yet. While models like o3 show impressive skills, they still lack embodied experience and common sense. Games expose gaps in real-time adaptation and intuitive physics — areas where humans excel effortlessly.

Q: Are 3D or modern AAA games part of future plans?
A: Absolutely. The team envisions expanding to complex 3D environments like The Legend of Zelda or Minecraft. These would push models further into open-ended problem-solving.

Q: How does reinforcement learning improve LLM gameplay?
A: Even basic RL techniques enhance planning and action selection. When combined with LLMs, they enable trial-and-error learning — critical for mastering unpredictable environments.

Q: Is Lmgame Bench publicly available?
A: Yes. The full codebase and documentation are open-source, encouraging community contributions and broader adoption in AI research.

The Future of AI Assessment Is Playful — But Profound

Lmgame Bench isn’t just about nostalgia. It represents a paradigm shift: true intelligence isn’t measured by answers on a test, but by sustained performance in dynamic worlds.

As AI evolves beyond chatbots and code generators, we need benchmarks that reflect real-world complexity — environments with uncertainty, partial observability, and evolving goals.

Classic games offer exactly that. They were designed to challenge human cognition. Now, they’re doing the same for machines.

And the journey has only begun. With modular design and extensible architecture, Lmgame Bench paves the way for future integration with modern titles, multi-agent scenarios, and even virtual economies.

👉 Explore how gaming benchmarks are shaping the next wave of intelligent systems

The message is clear: if an AI can’t beat Tetris without help, can it really navigate life?

In the race toward artificial general intelligence, every line cleared counts.