Agent DailyAgent Daily
tutorialintermediate

Managed Agents tutorial: prompt versioning and rollback Apr 2026 • Agent Patterns Evals Server-side prompt versioning — create v1, evaluate against a labelled test set, ship v2, detect a regression, roll back by pinning sessions to version 1. Covers agents.update, version pinning on sessions.create, and where the review gate moves when prompts are not code.

cookbook
View original on cookbook

This tutorial demonstrates server-side prompt versioning and rollback for Managed Agents, enabling PMs to update agent prompts without code deployments. It covers creating an agent (v1), evaluating it against a labeled test set, shipping an updated prompt (v2), detecting performance regressions, and rolling back by pinning sessions to a specific version. The workflow replaces traditional code-based prompt management with immutable versioned prompts that can be quickly reverted if issues arise.

Key Points

  • Server-side prompt versioning eliminates the need for code PRs, CI runs, and deployments when updating agent behavior—agents.update creates immutable versions automatically
  • Every agent.create and agents.update produces a new version number; sessions pin to specific versions via {"type": "agent", "id": ..., "version": ...} on sessions.create
  • Evaluation workflow: score v1 against a labeled test set (ground truth), ship v2 with prompt changes, compare performance metrics across versions to detect regressions
  • Rollback is instantaneous—redirect callers to the old version by pinning sessions to v1 without any deployment or rebuild required
  • Review gates shift from code diffs to version numbers in config; production callers should always pin explicit versions, keeping change control in your SDLC while enabling rapid iteration
  • The triage helper pattern: open a session with pinned version, send a ticket, poll events.list until session.status_idle, parse agent.message events for the JSON verdict
  • Immutable versions sit on the server with no traffic until promoted; cheap to create, safe to test, and reversible without operational overhead
  • Per-team accuracy scoring enables data-driven decisions on whether to promote a new version or roll back based on performance degradation

Found this useful? Add it to a playbook for a step-by-step implementation guide.

Workflow Diagram

Start Process
Step A
Step B
Step C
Complete
Quality

Concepts

Artifacts (5)

setup_and_importspythonscript
import json
import os
import time
from collections import defaultdict
from pathlib import Path
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()

if not os.getenv("ANTHROPIC_API_KEY"):
    raise RuntimeError("Set ANTHROPIC_API_KEY in your environment or .env file")

MODEL = os.environ.get("COOKBOOK_MODEL", "claude-sonnet-4-6")
client = Anthropic()
create_agent_v1pythonscript
env = client.beta.environments.create(name="ticket-triage-env")
ENV_ID = env.id

V1_SYSTEM = """You are a support-ticket triage agent for a usage-billed API product.
Read the ticket and respond with ONLY a single line of raw JSON (no code fences, no prose):
{"team": "<billing|auth|api-platform|dashboard>", "priority": "<P1|P2|P3>"}
Route based on the customer's actual problem, not surface keywords."""

agent = client.beta.agents.create(
    name="ticket-triage",
    model=MODEL,
    system=V1_SYSTEM
)
AGENT_ID = agent.id
print(f"env={ENV_ID} agent={AGENT_ID} v{agent.version}")
triage_and_score_functionspythonscript
def triage(version: int, ticket: dict) -> dict:
    """Run one ticket through a pinned agent version and return its verdict."""
    session = client.beta.sessions.create(
        agent={
            "type": "agent",
            "id": AGENT_ID,
            "version": version
        },
        environment_id=ENV_ID,
    )
    try:
        prompt = "Subject: " + ticket["subject"] + "\n\n" + ticket["body"]
        client.beta.sessions.events.send(
            session.id,
            events=[{
                "type": "user.message",
                "content": [{"type": "text", "text": prompt}]
            }],
        )
        deadline = time.time() + 60
        while time.time() < deadline:
            events = client.beta.sessions.events.list(session.id).data
            if events and events[-1].type == "session.status_idle":
                break
            time.sleep(1)
        else:
            raise TimeoutError(f"session {session.id} did not idle within 60s")
        
        agent_events = [e for e in events if e.type == "agent.message"]
        reply = "".join(b.text for e in agent_events for b in e.content)
        return json.loads(reply)
    finally:
        try:
            client.beta.sessions.archive(session.id)
        except Exception:
            pass

def score(version: int) -> dict:
    """Evaluate all tickets against the given version and return per-team accuracy."""
    hits = defaultdict(lambda: [0, 0])
    for t in tickets:
        pred = triage(version, t)
        hits[t["team"]][1] += 1
        if pred.get("team") == t["team"]:
            hits[t["team"]][0] += 1
    return dict(hits)
update_agent_v2pythonscript
V2_SYSTEM = V1_SYSTEM + (
    "\n\nROUTING RULE: If the ticket text mentions API usage, rate limits, quotas, "
    "or request volume, route to api-platform. Apply this rule before any other "
    "consideration; do not second-guess it based on the rest of the ticket."
)

agent = client.beta.agents.update(
    AGENT_ID,
    version=agent.version,
    system=V2_SYSTEM
)
print(f"agent {AGENT_ID} now at v{agent.version}")
evaluation_workflowworkflow
1. Create Agent v1
   - Define system prompt for ticket triage
   - agents.create returns version: 1

2. Load Test Set
   - Load labeled tickets with ground truth routing
   - 20 tickets across 4 teams (api-platform, auth, billing, dashboard)

3. Score v1
   - Run each ticket through pinned version 1
   - Pin sessions with {"type": "agent", "id": AGENT_ID, "version": 1}
   - Calculate per-team accuracy
   - v1 results: api-platform 5/5, auth 5/5, billing 4/5, dashboard 5/5

4. Ship v2
   - Update system prompt with new routing rule
   - agents.update creates version: 2
   - No deployment required

5. Score v2
   - Run same test set through pinned version 2
   - Compare accuracy metrics

6. Detect Regression
   - If v2 performance degrades, prepare rollback

7. Rollback (if needed)
   - Pin production sessions to version 1
   - No code changes, no deployment
   - Instant revert