Cost Optimization
ModelReins makes AI jobs cheap by routing them to the right provider. Local models cost nothing. Cloud models cost pennies per job. The router picks the cheapest option that meets the quality bar.
Approximate costs per job
Section titled “Approximate costs per job”These are approximate costs based on typical API pricing for common job sizes (500-2000 token input, 200-800 token output). Your actual costs depend on prompt length, output length, and provider pricing.
| Job Type | Provider / Model | Approx Cost | Notes |
|---|---|---|---|
| Simple classification | Claude Haiku | ~$0.001 | Yes/no, category, sentiment |
| Test case generation | Claude Haiku | ~$0.003 | Generates 5-10 test cases |
| Code review (single file) | Claude Sonnet | ~$0.02 | Detailed review with suggestions |
| Complex analysis | Claude Opus | ~$0.12 | Multi-file reasoning |
| Summarization | Gemini Flash | ~$0.0005 | Fastest cheap cloud option |
| Extraction (structured) | OpenAI GPT-4o-mini | ~$0.002 | JSON extraction |
| Any job | Ollama / LM Studio | $0.00 | Local models — your hardware only |
When to use local vs cloud
Section titled “When to use local vs cloud”Use local models (Ollama, LM Studio) when:
Section titled “Use local models (Ollama, LM Studio) when:”- The task is formulaic: classification, extraction, reformatting, templated generation.
- You don’t want a third-party AI provider in the loop: the prompt and completion stay in the local inference process on your hardware.
- Volume is high: hundreds of similar jobs where per-job cost adds up.
- Quality requirements are moderate: you need “good enough,” not “best possible.”
- You’re iterating: rapid prompt development where you’ll run the same job dozens of times.
Use cloud models when:
Section titled “Use cloud models when:”- The task requires complex reasoning: multi-step analysis, architectural decisions, subtle code bugs.
- Output quality is critical: customer-facing content, security audits, compliance analysis.
- The job needs a large context window: 100k+ tokens of input.
- You need specific capabilities: vision, function calling, structured output guarantees.
Routing strategies for cost
Section titled “Routing strategies for cost”Strategy 1: Tiered routing (recommended)
Section titled “Strategy 1: Tiered routing (recommended)”Use the saddle’s effort tiers to route by complexity. Trivial and quick jobs go to local workers. Standard and above go to cloud.
| Effort Tier | Typical routing |
|---|---|
| trivial | Local (Ollama) |
| quick | Local or cheap cloud (Haiku) |
| standard | Cloud (Haiku/Sonnet) |
| deep | Cloud (Sonnet) |
| critical | Best available (Opus) |
Strategy 2: Local-first with cloud fallback
Section titled “Strategy 2: Local-first with cloud fallback”The saddle’s local mode forces all dispatch to registered Ollama/LM Studio workers. If no local worker is available, dispatch returns 503 — no silent cloud fallback, no surprise bills.
Strategy 3: Mixed fleet
Section titled “Strategy 3: Mixed fleet”Run both local and cloud workers. The router picks based on job tags, effort tier, and worker availability. Most of your bill will come from the small percentage of jobs that actually need cloud reasoning.
- Start local, add cloud incrementally. Run everything on Ollama first. Identify which jobs actually need cloud quality, and only route those.
- Haiku is almost always enough. For tasks that need cloud providers, Haiku handles most cases at a fraction of the cost of larger models.
- Review your job history. The dashboard shows recent jobs with assigned workers. Look for expensive jobs that could be downgraded.
- Fan-out costs multiply. Dispatching to 3 workers at once runs 3 jobs. Use fan-out for comparisons, not for routine work.