Agent DailyAgent Daily
discussionadvanced

Ask HN: Burning $100K/week on LLM tokens – what are you doing to cut costs?

By seenkittyhackernews
View original on hackernews

A high-volume LLM user spending $103K/week on AI coding agents shares cost-reduction strategies and seeks battle-tested approaches. Current tactics include prompt caching (~30% savings), model tiering (Haiku for simple tasks, Sonnet for complex), and aggressive history truncation. The user is exploring token compression, intelligent context routing, and self-hosted models as next steps, with infrastructure positioned at an OpenClaw gateway for middleware injection.

Key Points

  • Implement prompt caching (Anthropic/OpenAI) for 30% cost reduction on cache hits—most effective for repeated system prompts and tool traces
  • Use model tiering strategy: deploy smaller models (Haiku) for routine tasks like linting; reserve larger models (Sonnet) for complex code generation
  • Aggressively truncate conversation history to reduce context window bloat—a primary driver of token waste in multi-turn agent interactions
  • Evaluate token-level compression middleware (LLMLingua, similar tools) at the API gateway layer before requests leave your infrastructure
  • Implement smart context routing: send only relevant context to each agent call instead of broadcasting full conversation state to every provider
  • Consider self-hosted models for the long tail of simple, repetitive tasks to eliminate per-token costs entirely
  • Centralize LLM routing through a single gateway (OpenClaw) to create a choke point for cost optimization middleware and multi-provider load balancing
  • Monitor tool trace overhead—long traces from agent execution are a major hidden cost driver; consider summarizing or filtering traces before re-sending
  • Measure cache hit rates and adjust caching strategy; caching only helps when the same prompts/context are reused across multiple requests
  • Benchmark real-world savings from each optimization—theoretical improvements often underperform at scale due to cache miss rates and routing overhead

Found this useful? Add it to a playbook for a step-by-step implementation guide.

Workflow Diagram

Start Process
Step A
Step B
Step C
Complete
Quality

Concepts

Artifacts (3)

LLM Cost Optimization Checklistconfig
# LLM Token Cost Optimization Strategy

## Immediate Wins (30-40% savings)
- [ ] Enable prompt caching on Anthropic/OpenAI for system prompts
- [ ] Implement model tiering: Haiku for linting/simple tasks, Sonnet for code gen
- [ ] Set aggressive history truncation (keep last N turns only)
- [ ] Monitor cache hit rates; target >60% for cached prompts

## Medium-term (40-60% savings)
- [ ] Integrate token compression middleware (LLMLingua) at gateway
- [ ] Build context router: analyze task type → send minimal relevant context
- [ ] Implement tool trace summarization before re-sending to API
- [ ] A/B test smaller models (Haiku) for more task types

## Long-term (60%+ savings)
- [ ] Self-host models for long-tail simple tasks (linting, formatting)
- [ ] Build local embedding cache to avoid redundant context lookups
- [ ] Implement request deduplication at gateway level
- [ ] Consider fine-tuned smaller models for domain-specific tasks

## Monitoring
- Track cost per task type (linting vs. code gen vs. debugging)
- Monitor cache hit rates by prompt type
- Measure latency impact of compression/routing optimizations
- Set cost alerts by provider and model tier
Gateway Middleware Architecturetemplate
# OpenClaw Gateway Middleware Stack

```
Incoming Request
    ↓
[1] Request Deduplication
    - Hash request signature
    - Return cached response if identical request in-flight
    ↓
[2] Token Compression
    - Apply LLMLingua or similar
    - Target: 20-40% token reduction
    ↓
[3] Context Router
    - Analyze task type from request
    - Extract only relevant context from knowledge base
    - Drop redundant system prompts if cached
    ↓
[4] Prompt Cache Lookup
    - Check Anthropic/OpenAI cache for matching prefix
    - Route to cached endpoint if hit
    ↓
[5] Model Selection
    - Route simple tasks → Haiku
    - Route complex tasks → Sonnet
    - Route specialized tasks → Self-hosted
    ↓
[6] Provider Routing
    - Load balance across Claude, GPT, Gemini
    - Prefer cheapest provider for equivalent quality
    ↓
API Call → Response
```

## Implementation Priority
1. Prompt cache lookup (easiest, ~30% savings)
2. Model tiering (straightforward, ~15% savings)
3. History truncation (quick win, ~10% savings)
4. Token compression (moderate effort, ~20% savings)
5. Context routing (high effort, ~25% savings)
6. Self-hosting (long-term, 50%+ for long tail)
Cost Tracking Querysqlscript
# SQL-style cost tracking for multi-provider LLM usage

SELECT 
    task_type,
    model_used,
    COUNT(*) as request_count,
    SUM(input_tokens) as total_input_tokens,
    SUM(output_tokens) as total_output_tokens,
    SUM(cost_usd) as total_cost,
    AVG(cost_usd) as avg_cost_per_request,
    SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END) as cache_hits,
    ROUND(100.0 * SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END) / COUNT(*), 2) as cache_hit_rate,
    SUM(CASE WHEN compression_applied THEN compressed_tokens ELSE input_tokens END) as effective_tokens
FROM llm_requests
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY task_type, model_used
ORDER BY total_cost DESC;

# Identify cost anomalies
SELECT 
    task_type,
    DATE(timestamp) as date,
    SUM(cost_usd) as daily_cost,
    AVG(input_tokens) as avg_context_size,
    MAX(input_tokens) as max_context_size
FROM llm_requests
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY task_type, DATE(timestamp)
HAVING SUM(cost_usd) > (SELECT AVG(daily_cost) * 1.5 FROM (
    SELECT DATE(timestamp), SUM(cost_usd) as daily_cost FROM llm_requests GROUP BY DATE(timestamp)
) t)
ORDER BY daily_cost DESC;