tutorialintermediate

The site reliability agent Feb 2026 • Claude Agent SDK Agent Patterns Build an incident response agent with read-write MCP tools for autonomous diagnosis, remediation, and post-mortem documentation.

March 8, 2026cookbook

This cookbook demonstrates building an autonomous SRE incident response agent using the Claude Agent SDK with read-write MCP tools for safe infrastructure access. The agent can investigate incidents by querying metrics and logs, diagnose root causes, apply remediations by editing configs and restarting services, and document post-mortems. The pattern uses a subprocess-based MCP server with scoped tool access, clear tool descriptions, and human-in-the-loop workflows to enable autonomous yet controlled incident response.

Key Points

•Build autonomous incident response agents that investigate, diagnose, remediate, and document incidents without human intervention
•Use MCP (Model Context Protocol) tools with restricted directory scoping, command allowlists, and validation hooks to safely grant agents write access to infrastructure
•Clear, detailed tool descriptions in JSON Schema are more effective than elaborate prompts for driving autonomous agent behavior and tool selection
•Synthesize multiple production signals (metrics, logs, alerts, configs) to build coherent diagnoses that single data sources cannot reveal
•Implement human-in-the-loop workflows that separate investigation phases from remediation phases, allowing humans to control when agents act autonomously
•Run the MCP tool server as a separate subprocess connected via stdin/stdout JSON-RPC for isolation and clean separation of concerns
•Instrument services with Prometheus metrics (error rates, latency, connection pools) to provide real-time visibility for incident detection
•Extend the agent with external platforms (PagerDuty, Confluence) by conditionally activating tools when API keys are configured
•Use Docker Compose for local infrastructure simulation during agent development and testing before deploying to production
•Structure tools across categories (Prometheus queries, infrastructure commands, diagnostics, documentation) to organize agent capabilities logically

Found this useful? Add it to a playbook for a step-by-step implementation guide.

Workflow Diagram

Start Process

Step A

Step B

Step C

Complete

Quality★★★★★

Concepts

Monitoring MCP Servers Skills & Tools Deployment Agent Teams Tool Use Security Automation

Artifacts (5)

Environment Setupconfig

ANTHROPIC_API_KEY=your-key-here

Python Dependencies Installationbashcommand

pip install claude-agent-sdk httpx python-dotenv

Agent Initializationpythonscript

import os
import sys
from pathlib import Path
from dotenv import load_dotenv
from claude_agent_sdk import (
    ClaudeAgentOptions,
    query,
    AssistantMessage,
    TextBlock,
    ToolUseBlock,
    ResultMessage,
)

load_dotenv()
if not os.environ.get("ANTHROPIC_API_KEY"):
    raise ValueError("ANTHROPIC_API_KEY not set. Add it to a .env file.")

MODEL = "claude-opus-4-6"

Infrastructure Setup Scriptpythonscript

# infra_setup.py generates:
# - config/docker-compose.yml (PostgreSQL, API server, traffic generator, Prometheus)
# - config/prometheus.yml (scrape config for metrics collection)
# - config/api-server.env (environment variables including DB_POOL_SIZE)
# - services/api_server.py (FastAPI app with Prometheus instrumentation)
# - scripts/traffic_generator.py (continuous load generation)
# - hooks/ (safety validation scripts)

assert os.path.exists("infra_setup.py"), "infra_setup.py not found"
subprocess.run([sys.executable, "infra_setup.py"], check=True)

MCP Server Tool Categoriestemplate

Tool Server Categories:

1. Prometheus Tools:
   - query_metrics: Query time-series metrics
   - list_metrics: Discover available metrics
   - get_service_health: Get health summaries

2. Infrastructure Tools:
   - read_config_file: Read configuration files
   - edit_config_file: Modify configuration files
   - run_shell_command: Execute Docker/system commands
   - get_container_logs: Inspect container logs

3. Diagnostics Tools:
   - get_logs: Retrieve application logs
   - get_alerts: Check alert history
   - get_recent_deployments: Track deployment changes
   - execute_runbook: Run structured playbooks

4. Documentation Tools:
   - write_postmortem: Document incident post-mortems
   - (Optional) PagerDuty integration
   - (Optional) Confluence integration