tutorialadvanced
Reproduce Claude's agentic search benchmark scores in the Messages API Jun 2026 • Evals Tools Build a Messages API harness that reproduces published DeepSearchQA and BrowseComp scores, using programmatic tool calling, server-side compaction, and task budgets.
cookbook
View original on cookbookThis cookbook demonstrates how to reproduce Claude's published agentic search benchmark scores (DeepSearchQA, BrowseComp) using the Messages API with programmatic tool calling, server-side compaction, and task budgets. The key is proper harness configuration—API parameters that become critical for agents running 30+ tool calls across hundreds of thousands of tokens. By following this guide, you'll build an agentic search loop that matches Claude's official benchmark performance and understand why each configuration choice matters for long-horizon tasks.
Key Points
- •Use programmatic tool calling (PTC) to enable Claude to call web_search and web_fetch from within a code_execution sandbox, reducing context bloat by keeping page bodies in-sandbox and returning only summaries
- •Configure adaptive thinking mode and max effort output to enable Claude to plan thoroughly and work harder on complex research tasks
- •Set task budgets (3M tokens recommended) to control total computational spend and prevent runaway agent loops
- •Enable server-side compaction with a 200K token trigger to automatically summarize conversation history while preserving the original question and key findings
- •Mark web_search and web_fetch as callable only from code_execution with response_inclusion set to 'excluded' to keep results out of the main conversation flow
- •Use extended timeout settings (60-minute read timeout) for streaming calls that may run long in-sandbox tool execution before returning results
- •Hold the grader model fixed (Claude Opus 4.6) across all test runs to ensure score comparability when testing different model versions
- •Wrap questions in a prompt template that asks for planning, multiple search rounds, and final answers in <result> tags for clean extraction and grading
- •Map source benchmark schemas (DeepSearchQA, BrowseComp) to required fields: problem, answer, and answer_type
- •Expect full 900-question benchmarks to cost several hundred dollars and take a few hours at moderate concurrency; demo runs with 3 questions cost a few dollars and take 5-10 minutes
Found this useful? Add it to a playbook for a step-by-step implementation guide.
Workflow Diagram
Start Process
Step A
Step B
Step C
Complete
Concepts
Artifacts (4)
Tool Configurationpythonconfig
TOOLS = [
{
"type": "code_execution_20260521",
"name": "code_execution"
},
{
"type": "web_search_20260318",
"name": "web_search",
"max_uses": 10_000,
"allowed_callers": ["code_execution_20260521"],
"response_inclusion": "excluded"
},
{
"type": "web_fetch_20260318",
"name": "web_fetch",
"max_uses": 10_000,
"max_content_tokens": 1_000_000,
"allowed_callers": ["code_execution_20260521"],
"response_inclusion": "excluded"
}
]
BETAS = [
"compact-2026-01-12",
"task-budgets-2026-03-13"
]Request Configurationpythonconfig
THINKING = {
"type": "adaptive"
}
OUTPUT_CONFIG = {
"effort": "max",
"task_budget": {
"type": "tokens",
"total": 3_000_000
}
}
COMPACTION_TRIGGER = 200_000
COMPACT_INSTRUCTIONS = (
"Your summary MUST begin by restating the user's ORIGINAL QUESTION "
"verbatim and in full, wrapped in <original_question> and "
"</original_question> tags. Then summarize..."
)Client Setuppythonscript
import anthropic
from dotenv import load_dotenv
load_dotenv()
MODEL = "claude-sonnet-5"
GRADER_MODEL = "claude-opus-4-6"
client = anthropic.Anthropic(
max_retries=20,
timeout=anthropic.Timeout(
5.0,
read=3600.0,
write=600.0,
pool=600.0
)
)Installation Commandbashcommand
pip install -U "anthropic>=0.111.0" pandas python-dotenv