discussionintermediate

Show HN: A business SIM where humans beat GPT-5 by 9.8 X

By sumit_pspMarch 6, 2026hackernews

Skyfall AI created Mini Amusement Parks (MAPs), a RollerCoaster Tycoon-style business simulator benchmark to evaluate whether AI agents can manage real business operations with stochastic events, incomplete information, and resource constraints. Testing revealed humans outperformed GPT-5 agents by 9.8x, with AI systems failing at long-term planning, maintenance prioritization, and handling randomness—demonstrating that current LLMs lack the operational intelligence needed for true AI CEO capabilities.

Key Points

•Mini Amusement Parks (MAPs) is a RollerCoaster Tycoon-style business simulator designed to test whether AI agents can coherently operate a business with stochastic events, incomplete information, and resource constraints
•Humans outperformed GPT-5 agents by 9.8X, with even optimized models reaching <10% of human performance despite documentation, tool use, and sandbox practice modes
•Current LLM agents fail consistently due to: chasing flashy upgrades over profitable ones, ignoring maintenance/staffing/restocking, overreacting to noise, lacking long-term planning, and poor temporal reasoning
•LLMs can use tools effectively but cannot manage complex systems where randomness, time horizons, and spatial constraints matter—core requirements for real business operations
•True AI CEO capability requires foresight, risk modeling, temporal reasoning, causal understanding, prioritization under uncertainty, and adaptive planning—capabilities where current models fundamentally break
•The benchmark intentionally favored AI models with full documentation, step-by-step interfaces, sandbox exploration, extra observations, and multiple prompting strategies to ensure fair testing
•Sandbox training and practice modes often made agent performance worse, suggesting that LLMs struggle to transfer learning across dynamic, uncertain environments
•Current narratives about LLMs replacing CEOs or running entire companies are unsupported; an AI CEO must demonstrate operational intelligence, not just conversational ability or chain-of-thought reasoning
•The benchmark serves as a foundation for understanding what AI systems actually need to achieve enterprise-level decision-making and operational intelligence
•Community participation is invited to beat the models, critique the benchmark, and engage in honest discussion about realistic AI CEO capabilities versus marketing hype

Found this useful? Add it to a playbook for a step-by-step implementation guide.

Workflow Diagram

Start Process

Step A

Step B

Step C

Complete

Quality★★★★★

Concepts

Use Cases Multi-Agent Systems Research Agent Teams Automation