tutorialintermediate

Build an SRE incident response agent with Claude Managed Agents Apr 2026 • Agent Patterns Observability A webhook-triggered responder that investigates logs and runbooks with a custom Skill, fixes infrastructure code, and gates the PR behind a human-approval custom tool — with the full audit trail in the Console.

April 9, 2026cookbook

This tutorial demonstrates building a webhook-triggered SRE incident response agent using Claude Managed Agents that automatically investigates production alerts, consults runbooks, proposes infrastructure fixes via pull requests, and gates merging behind human approval. The agent combines built-in sandbox tools (bash, read, edit) with custom tools for PR management and human-in-the-loop approval, providing complete audit trails in the Anthropic Console. The example uses mocked PagerDuty, GitHub, and Datadog integrations to focus on agent patterns, with guidance for swapping in real services.

Key Points

•Upload Skills to encode team conventions and runbook knowledge—the agent sees a one-line description first and reads full content only when relevant
•Combine three tool types: built-in agent_toolset_20260401 (bash, read, grep, edit) for sandbox investigation, custom Skills for domain knowledge, and custom tools (open_pull_request, request_approval, merge_pull_request) for external system integration
•Gate destructive actions behind human approval—the agent calls request_approval() and only proceeds to merge_pull_request() if the human approves, preventing autonomous infrastructure changes
•Trigger agent sessions from webhook payloads (PagerDuty alerts)—each alert becomes the first user message, so one agent handles any incident type
•Mount investigation resources (logs, infra repo, runbooks) via the Files API as environment resources so they're available in the sandbox at expected paths
•Follow a structured workflow: read logs → identify failure signature → find root cause → edit infrastructure code → open PR with unified diff → request human approval → merge only if approved
•Keep fixes minimal and focused—the system prompt explicitly instructs the agent not to refactor unrelated config, reducing risk and review burden
•Leverage the Anthropic Console for complete observability—every step, tool call, and decision is automatically recorded with full audit trail for compliance and debugging
•Use mocked external services (PagerDuty, GitHub, Datadog) during development to focus on agent patterns; swap for real integrations by replacing mock handlers with actual API calls
•Set ANTHROPIC_API_KEY as the only required credential—no need for external service keys during local development with fixtures

Found this useful? Add it to a playbook for a step-by-step implementation guide.

Workflow Diagram

Start Process

Step A

Step B

Step C

Complete

Quality★★★★★

Concepts

Monitoring MCP Servers Skills & Tools Deployment Agent Teams Tool Use Security Automation Coding Workflows

Artifacts (3)

runbook_skill.mdmarkdownconfig

---
name: incident-runbooks
description: How to triage production incidents using the team runbooks.
---

# Incident runbooks

When an alert references a service, locate that service's recent logs and identify the failure signature (the repeating error class, exit code, or status pattern). Consult the team runbooks before proposing any fix. Runbooks are organised by failure signature — for example `oom.md`, `5xx.md`, `latency.md`. Each one lists the triage steps for that class of failure and the configuration that usually needs to change. Any fix to infrastructure code must be opened as a pull request that cites the runbook you followed. Do not patch live resources directly.

sre_agent_creation.pypythonscript

import json
import os
import time
from pathlib import Path
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()
MODEL = os.getenv("COOKBOOK_MODEL", "claude-opus-4-6")

# Step 1: Upload runbook skill
RUNBOOK_SKILL = """---
name: incident-runbooks
description: How to triage production incidents using the team runbooks.
---

# Incident runbooks
When an alert references a service, locate that service's recent logs and identify the failure signature.
Consult the team runbooks before proposing any fix.
Runbooks are organised by failure signature — for example `oom.md`, `5xx.md`, `latency.md`.
Each one lists the triage steps for that class of failure and the configuration that usually needs to change.
Any fix to infrastructure code must be opened as a pull request that cites the runbook you followed.
Do not patch live resources directly."""

skill = client.beta.skills.create(
    display_title="incident-runbooks",
    files=[("incident-runbooks/SKILL.md", RUNBOOK_SKILL.encode(), "text/markdown")],
)
print(f"skill: {skill.id} (version {skill.latest_version})")

# Step 2: Create the agent
SRE_SYSTEM_PROMPT = """You are an on-call SRE agent. Each user message is a PagerDuty alert payload.
Triage it to root cause and ship the minimal safe fix.
The session workspace contains the recent logs, the infrastructure repo, and the team runbooks for the alerting service.
Explore it to find what you need.

Workflow for every alert:
1. Read the logs and identify the failure signature.
2. Find the root cause in the infrastructure repo, save a copy of the original file, edit it in place, then produce a unified diff with `diff -u`.
3. open_pull_request(title, body, diff) with the fix.
4. request_approval(summary) and wait for the human's decision.
5. Only if the result is "approved", merge_pull_request(pr_number). Otherwise stop and report.

Never call merge_pull_request unless request_approval returned "approved".
Keep the fix minimal — do not refactor unrelated config."""

agent = client.beta.agents.create(
    name="cookbook-sre-responder",
    model=MODEL,
    system=SRE_SYSTEM_PROMPT,
    skills=[{"type": "custom", "skill_id": skill.id, "version": skill.latest_version}],
    tools=[
        {
            "type": "agent_toolset_20260401",
            "default_config": {
                "enabled": True,
                "permission_policy": {"type": "always_allow"},
            },
            "configs": [
                {"name": "web_search", "enabled": False},
                {"name": "web_fetch", "enabled": False},
            ],
        },
        {
            "type": "custom",
            "name": "open_pull_request",
            "description": "Open a pull request against the infra repo with the proposed fix.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "body": {"type": "string"},
                    "diff": {"type": "string", "description": "Unified diff of the change."},
                },
                "required": ["title", "body", "diff"],
            },
        },
        {
            "type": "custom",
            "name": "request_approval",
            "description": "Ask the on-call human to approve the proposed PR before merging.",
            "input_schema": {
                "type": "object",
                "properties": {"summary": {"type": "string"}},
                "required": ["summary"],
            },
        },
        {
            "type": "custom",
            "name": "merge_pull_request",
            "description": "Merge an approved pull request.",
            "input_schema": {
                "type": "object",
                "properties": {"pr_number": {"type": "integer"}},
                "required": ["pr_number"],
            },
        },
    ],
)
print(f"agent: {agent.id} v{agent.version}")

agent_tools_schema.jsonjsonconfig

{
  "tools": [
    {
      "type": "agent_toolset_20260401",
      "description": "Built-in sandbox tools for investigation",
      "capabilities": ["bash", "read", "grep", "edit", "diff"]
    },
    {
      "type": "custom",
      "name": "open_pull_request",
      "description": "Open a pull request against the infra repo with the proposed fix",
      "input_schema": {
        "type": "object",
        "properties": {
          "title": {"type": "string"},
          "body": {"type": "string"},
          "diff": {"type": "string", "description": "Unified diff of the change"}
        },
        "required": ["title", "body", "diff"]
      }
    },
    {
      "type": "custom",
      "name": "request_approval",
      "description": "Ask the on-call human to approve the proposed PR before merging",
      "input_schema": {
        "type": "object",
        "properties": {
          "summary": {"type": "string"}
        },
        "required": ["summary"]
      }
    },
    {
      "type": "custom",
      "name": "merge_pull_request",
      "description": "Merge an approved pull request",
      "input_schema": {
        "type": "object",
        "properties": {
          "pr_number": {"type": "integer"}
        },
        "required": ["pr_number"]
      }
    }
  ]
}