tutorialadvanced

Reproduce Claude's agentic search benchmark scores in the Messages API Jun 2026 • Evals Tools Build a Messages API harness that reproduces published DeepSearchQA and BrowseComp scores, using programmatic tool calling, server-side compaction, and task budgets.

July 1, 2026cookbook

This cookbook demonstrates how to reproduce Claude's published agentic search benchmark scores (DeepSearchQA, BrowseComp) using the Messages API with programmatic tool calling, server-side compaction, and task budgets. The key is proper harness configuration—API parameters that become critical for agents running 30+ tool calls across hundreds of thousands of tokens. By following this guide, you'll build an agentic search loop that matches Claude's official benchmark performance and understand why each configuration choice matters for long-horizon tasks.

Key Points

•Use programmatic tool calling (PTC) to enable Claude to call web_search and web_fetch from within a code_execution sandbox, reducing context bloat by keeping page bodies in-sandbox and returning only summaries
•Configure adaptive thinking mode and max effort output to enable Claude to plan thoroughly and work harder on complex research tasks
•Set task budgets (3M tokens recommended) to control total computational spend and prevent runaway agent loops
•Enable server-side compaction with a 200K token trigger to automatically summarize conversation history while preserving the original question and key findings
•Mark web_search and web_fetch as callable only from code_execution with response_inclusion set to 'excluded' to keep results out of the main conversation flow
•Use extended timeout settings (60-minute read timeout) for streaming calls that may run long in-sandbox tool execution before returning results
•Hold the grader model fixed (Claude Opus 4.6) across all test runs to ensure score comparability when testing different model versions
•Wrap questions in a prompt template that asks for planning, multiple search rounds, and final answers in <result> tags for clean extraction and grading
•Map source benchmark schemas (DeepSearchQA, BrowseComp) to required fields: problem, answer, and answer_type
•Expect full 900-question benchmarks to cost several hundred dollars and take a few hours at moderate concurrency; demo runs with 3 questions cost a few dollars and take 5-10 minutes

Found this useful? Add it to a playbook for a step-by-step implementation guide.

Workflow Diagram

Start Process

Step A

Step B

Step C

Complete

Quality★★★★★

Concepts

Monitoring MCP Servers Skills & Tools Research Agent Teams Tool Use Automation Coding Workflows

Artifacts (4)

Tool Configurationpythonconfig

TOOLS = [
  {
    "type": "code_execution_20260521",
    "name": "code_execution"
  },
  {
    "type": "web_search_20260318",
    "name": "web_search",
    "max_uses": 10_000,
    "allowed_callers": ["code_execution_20260521"],
    "response_inclusion": "excluded"
  },
  {
    "type": "web_fetch_20260318",
    "name": "web_fetch",
    "max_uses": 10_000,
    "max_content_tokens": 1_000_000,
    "allowed_callers": ["code_execution_20260521"],
    "response_inclusion": "excluded"
  }
]

BETAS = [
  "compact-2026-01-12",
  "task-budgets-2026-03-13"
]

Request Configurationpythonconfig

THINKING = {
  "type": "adaptive"
}

OUTPUT_CONFIG = {
  "effort": "max",
  "task_budget": {
    "type": "tokens",
    "total": 3_000_000
  }
}

COMPACTION_TRIGGER = 200_000

COMPACT_INSTRUCTIONS = (
  "Your summary MUST begin by restating the user's ORIGINAL QUESTION "
  "verbatim and in full, wrapped in <original_question> and "
  "</original_question> tags. Then summarize..."
)

Client Setuppythonscript

import anthropic
from dotenv import load_dotenv

load_dotenv()

MODEL = "claude-sonnet-5"
GRADER_MODEL = "claude-opus-4-6"

client = anthropic.Anthropic(
    max_retries=20,
    timeout=anthropic.Timeout(
        5.0,
        read=3600.0,
        write=600.0,
        pool=600.0
    )
)

Installation Commandbashcommand

pip install -U "anthropic>=0.111.0" pandas python-dotenv