Agent DailyAgent Daily
tutorialintermediate

Outcomes: agents that verify their own work May 2026 • Agent Patterns Evals Build a grade-and-revise loop with Outcomes: a writer drafts a cited research brief, a stateless grader fetches every URL and checks every quote against a rubric, and feedback drives revisions until the brief passes. Covers user.define_outcome, the span.outcome_evaluation_* events, and how to write a rubric the grader can act on.

cookbook
View original on cookbook

This guide teaches how to build a grade-and-revise loop using Outcomes in Claude Managed Agents, where a writer agent drafts a cited research brief and a stateless grader independently verifies every URL, quote, and claim against a detailed rubric. The grader provides structured feedback that drives revisions until the brief passes, eliminating manual review cycles. Key techniques include writing specific, actionable rubrics that force concrete evidence, using span.outcome_evaluation_* events to track the loop, and understanding when Outcomes is the right tool for quality assurance.

Key Points

  • Write rubrics that are far more specific than the task description—pin down exact requirements (e.g., 'GAAP net loss from a 10-K on sec.gov' instead of 'named-operator economics') so the grader can actually verify them
  • Require concrete evidence in every rubric criterion: fetched pages, traced formulas, or file:line references—a grader that just skims and approves is the default failure mode
  • Describe the goal and what counts as proof, not the steps—let the grader use the writer's full toolset to find evidence rather than prescribing specific commands that may fail silently
  • Anticipate writer shortcuts in the rubric (e.g., 'Do NOT corroborate via mirrors, reposts, or search snippets') to prevent low-quality sources from passing
  • Mandate a clear feedback format: one-line scoreboard followed by one bullet per failure stating what's wrong and what to do—this is the writer's only signal for revision
  • Spell out what's out of bounds to prevent the grader from thrashing on style nits, pre-existing issues, or scope creep
  • Use user.define_outcome to hand the session two things: a description for the writer and a rubric for the grader, which runs in its own context window with no visibility into the writer's reasoning
  • Set a revision cap on the grade-and-revise loop to prevent infinite cycles while allowing enough iterations for meaningful improvement
  • Use span.outcome_evaluation_* events to follow the grader's feedback and recognize when Outcomes is the right tool (quality assurance on structured outputs with verifiable criteria)
  • Start with a known-good example of the artifact and ask Claude to analyze what makes it good, then convert that analysis into rubric criteria—this beats writing from a blank page

Found this useful? Add it to a playbook for a step-by-step implementation guide.

Workflow Diagram

Start Process
Step A
Step B
Step C
Complete
Quality

Concepts

Artifacts (3)

environment_and_agent_setuppythonscript
import os
import re
import time
import anthropic
from dotenv import load_dotenv

load_dotenv()

BETAS = ["managed-agents-2026-04-01"]
MODEL = os.environ.get("COOKBOOK_MODEL", "claude-sonnet-4-6")
client = anthropic.Anthropic()

# Create environment
env = client.beta.environments.create(
    name="research-brief",
    config={
        "type": "anthropic_cloud",
        "networking": {"type": "unrestricted"}
    },
)

# Create writer agent
writer = client.beta.agents.create(
    name="Research Analyst",
    model=MODEL,
    system="""You are a research analyst. You write one-page business briefs.
Cite every factual claim with an inline footnote [n].
End the brief with a Sources section in this exact format, one entry per line:
[n] "verbatim quote from the page, 25 words or fewer" - Title - URL
Only cite pages you actually fetched and read.
The quote must be copied character-for-character from the page.
Cite no more than 6 sources total. Pick the strongest; do not pad.
Save the brief to /mnt/session/outputs/brief.md.""",
    tools=[
        {
            "type": "agent_toolset_20260401",
            "configs": [
                {"name": "web_search"},
                {"name": "web_fetch"},
                {"name": "read"},
                {"name": "write"},
            ],
        }
    ],
    betas=BETAS,
)

# Create session
session = client.beta.sessions.create(
    agent={"type": "agent", "id": writer.id, "version": writer.version},
    environment_id=env.id,
    title="Brief: EV fast-charging unit economics",
    betas=BETAS,
)

print(f"Session {session.id}")
task_and_rubric_definitionpythontemplate
TASK = """Write a brief on the unit economics of public DC fast charging in the United States.
The brief should cover:
1. Capex range
2. Demand charges
3. Utilization breakeven
4. Subsidy programs
5. Named-operator economics
6. A contrarian or skeptical source
7. Hardware vs install cost split
"""

RUBRIC = """
You are reviewing a research brief at /mnt/session/outputs/brief.md against a coverage checklist and verifying its citations.
The writer was told the seven topics to cover.

For each topic, check that:
1. The brief addresses it (not just mentions it)
2. Every factual claim is cited with [n]
3. Every citation is a real URL that returns 200
4. The quoted string appears verbatim on that page
5. The quote actually supports the claim (not cherry-picked or out of context)
6. For named-operator economics: cite a GAAP net loss from a 10-K or 10-Q on sec.gov, not a press release
7. For contrarian source: verify it's a genuine opposing view, not just a cautionary note

Do NOT corroborate via mirrors, reposts, or search snippets.
Do NOT accept dead links or 404s.
Do NOT count the same source twice under different topics.

Return a one-line scoreboard: [PASS] or [FAIL: n criteria]
Then one bullet per failure:
- [Criterion]: What's wrong and what to do.
"""

print(TASK)
print("\n---\n")
print(RUBRIC)
rubric_best_practicespythontemplate
# Rubric Writing Best Practices

## 1. Make each criterion checkable
# BAD: "Check that the brief covers demand charges"
# GOOD: "Open the brief, find the demand-charges section, and confirm it states a $/kW figure or a % of operating cost"

## 2. Make the grader earn satisfied
# Require concrete evidence before passing:
# - A fetched page with status 200
# - A traced formula with specific numbers
# - A file:line reference to the source

## 3. Describe the goal, not the steps
# BAD: "Run the web_fetch tool on each URL"
# GOOD: "Verify every URL returns 200 and the quoted string appears verbatim on that page"
# (The grader has the full toolset and will find the right way)

## 4. Anticipate the writer's shortcuts
# Add explicit no-fire rules:
# - "Do NOT corroborate via mirrors, reposts, or search snippets"
# - "Do NOT accept dead links or 404s"
# - "Do NOT count the same source twice under different topics"

## 5. Mandate the feedback format
# Ask for:
# - One-line scoreboard: [PASS] or [FAIL: n criteria]
# - One bullet per failure: [Criterion]: What's wrong and what to do

## 6. Tell the grader what to ignore
# Spell out what's out of bounds:
# - Style nits
# - Pre-existing issues
# - Scope creep
# - Have the grader self-check each finding before raising it

## 7. Bootstrap from examples
# If you don't have a rubric yet:
# 1. Hand Claude a known-good example of the artifact
# 2. Ask it to analyze what makes it good
# 3. Turn that analysis into criteria
# This beats writing from a blank page.