discussionintermediate
Show HN: A business SIM where humans beat GPT-5 by 9.8 X
By sumit_psphackernews
View original on hackernewsSkyfall AI created Mini Amusement Parks (MAPs), a RollerCoaster Tycoon-style business simulator benchmark to evaluate whether AI agents can manage real business operations with stochastic events, incomplete information, and resource constraints. Testing revealed humans outperformed GPT-5 agents by 9.8x, with AI systems failing at long-term planning, maintenance prioritization, and handling randomness—demonstrating that current LLMs lack the operational intelligence needed for true AI CEO capabilities.
Key Points
- •Mini Amusement Parks (MAPs) is a RollerCoaster Tycoon-style business simulator designed to test whether AI agents can coherently operate a business with stochastic events, incomplete information, and resource constraints
- •Humans outperformed GPT-5 agents by 9.8X, with even optimized models reaching <10% of human performance despite documentation, tool use, and sandbox practice modes
- •Current LLM agents fail consistently due to: chasing flashy upgrades over profitable ones, ignoring maintenance/staffing/restocking, overreacting to noise, lacking long-term planning, and poor temporal reasoning
- •LLMs can use tools effectively but cannot manage complex systems where randomness, time horizons, and spatial constraints matter—core requirements for real business operations
- •True AI CEO capability requires foresight, risk modeling, temporal reasoning, causal understanding, prioritization under uncertainty, and adaptive planning—capabilities where current models fundamentally break
- •The benchmark intentionally favored AI models with full documentation, step-by-step interfaces, sandbox exploration, extra observations, and multiple prompting strategies to ensure fair testing
- •Sandbox training and practice modes often made agent performance worse, suggesting that LLMs struggle to transfer learning across dynamic, uncertain environments
- •Current narratives about LLMs replacing CEOs or running entire companies are unsupported; an AI CEO must demonstrate operational intelligence, not just conversational ability or chain-of-thought reasoning
- •The benchmark serves as a foundation for understanding what AI systems actually need to achieve enterprise-level decision-making and operational intelligence
- •Community participation is invited to beat the models, critique the benchmark, and engage in honest discussion about realistic AI CEO capabilities versus marketing hype
Found this useful? Add it to a playbook for a step-by-step implementation guide.
Workflow Diagram
Start Process
Step A
Step B
Step C
Complete