discussionadvanced
Ask HN: Burning $100K/week on LLM tokens – what are you doing to cut costs?
By seenkittyhackernews
View original on hackernewsA high-volume LLM user spending $103K/week on AI coding agents shares cost-reduction strategies and seeks battle-tested approaches. Current tactics include prompt caching (~30% savings), model tiering (Haiku for simple tasks, Sonnet for complex), and aggressive history truncation. The user is exploring token compression, intelligent context routing, and self-hosted models as next steps, with infrastructure positioned at an OpenClaw gateway for middleware injection.
Key Points
- •Implement prompt caching (Anthropic/OpenAI) for 30% cost reduction on cache hits—most effective for repeated system prompts and tool traces
- •Use model tiering strategy: deploy smaller models (Haiku) for routine tasks like linting; reserve larger models (Sonnet) for complex code generation
- •Aggressively truncate conversation history to reduce context window bloat—a primary driver of token waste in multi-turn agent interactions
- •Evaluate token-level compression middleware (LLMLingua, similar tools) at the API gateway layer before requests leave your infrastructure
- •Implement smart context routing: send only relevant context to each agent call instead of broadcasting full conversation state to every provider
- •Consider self-hosted models for the long tail of simple, repetitive tasks to eliminate per-token costs entirely
- •Centralize LLM routing through a single gateway (OpenClaw) to create a choke point for cost optimization middleware and multi-provider load balancing
- •Monitor tool trace overhead—long traces from agent execution are a major hidden cost driver; consider summarizing or filtering traces before re-sending
- •Measure cache hit rates and adjust caching strategy; caching only helps when the same prompts/context are reused across multiple requests
- •Benchmark real-world savings from each optimization—theoretical improvements often underperform at scale due to cache miss rates and routing overhead
Found this useful? Add it to a playbook for a step-by-step implementation guide.
Workflow Diagram
Start Process
Step A
Step B
Step C
Complete
Concepts
Artifacts (3)
LLM Cost Optimization Checklistconfig
# LLM Token Cost Optimization Strategy
## Immediate Wins (30-40% savings)
- [ ] Enable prompt caching on Anthropic/OpenAI for system prompts
- [ ] Implement model tiering: Haiku for linting/simple tasks, Sonnet for code gen
- [ ] Set aggressive history truncation (keep last N turns only)
- [ ] Monitor cache hit rates; target >60% for cached prompts
## Medium-term (40-60% savings)
- [ ] Integrate token compression middleware (LLMLingua) at gateway
- [ ] Build context router: analyze task type → send minimal relevant context
- [ ] Implement tool trace summarization before re-sending to API
- [ ] A/B test smaller models (Haiku) for more task types
## Long-term (60%+ savings)
- [ ] Self-host models for long-tail simple tasks (linting, formatting)
- [ ] Build local embedding cache to avoid redundant context lookups
- [ ] Implement request deduplication at gateway level
- [ ] Consider fine-tuned smaller models for domain-specific tasks
## Monitoring
- Track cost per task type (linting vs. code gen vs. debugging)
- Monitor cache hit rates by prompt type
- Measure latency impact of compression/routing optimizations
- Set cost alerts by provider and model tierGateway Middleware Architecturetemplate
# OpenClaw Gateway Middleware Stack
```
Incoming Request
↓
[1] Request Deduplication
- Hash request signature
- Return cached response if identical request in-flight
↓
[2] Token Compression
- Apply LLMLingua or similar
- Target: 20-40% token reduction
↓
[3] Context Router
- Analyze task type from request
- Extract only relevant context from knowledge base
- Drop redundant system prompts if cached
↓
[4] Prompt Cache Lookup
- Check Anthropic/OpenAI cache for matching prefix
- Route to cached endpoint if hit
↓
[5] Model Selection
- Route simple tasks → Haiku
- Route complex tasks → Sonnet
- Route specialized tasks → Self-hosted
↓
[6] Provider Routing
- Load balance across Claude, GPT, Gemini
- Prefer cheapest provider for equivalent quality
↓
API Call → Response
```
## Implementation Priority
1. Prompt cache lookup (easiest, ~30% savings)
2. Model tiering (straightforward, ~15% savings)
3. History truncation (quick win, ~10% savings)
4. Token compression (moderate effort, ~20% savings)
5. Context routing (high effort, ~25% savings)
6. Self-hosting (long-term, 50%+ for long tail)Cost Tracking Querysqlscript
# SQL-style cost tracking for multi-provider LLM usage
SELECT
task_type,
model_used,
COUNT(*) as request_count,
SUM(input_tokens) as total_input_tokens,
SUM(output_tokens) as total_output_tokens,
SUM(cost_usd) as total_cost,
AVG(cost_usd) as avg_cost_per_request,
SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END) as cache_hits,
ROUND(100.0 * SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END) / COUNT(*), 2) as cache_hit_rate,
SUM(CASE WHEN compression_applied THEN compressed_tokens ELSE input_tokens END) as effective_tokens
FROM llm_requests
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY task_type, model_used
ORDER BY total_cost DESC;
# Identify cost anomalies
SELECT
task_type,
DATE(timestamp) as date,
SUM(cost_usd) as daily_cost,
AVG(input_tokens) as avg_context_size,
MAX(input_tokens) as max_context_size
FROM llm_requests
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY task_type, DATE(timestamp)
HAVING SUM(cost_usd) > (SELECT AVG(daily_cost) * 1.5 FROM (
SELECT DATE(timestamp), SUM(cost_usd) as daily_cost FROM llm_requests GROUP BY DATE(timestamp)
) t)
ORDER BY daily_cost DESC;