Skip to content

Cost Optimization

ModelReins makes AI jobs cheap by routing them to the right provider. Local models cost nothing. Cloud models cost pennies per job. The router picks the cheapest option that meets the quality bar.

These are approximate costs based on typical API pricing for common job sizes (500-2000 token input, 200-800 token output). Your actual costs depend on prompt length, output length, and provider pricing.

Job TypeProvider / ModelApprox CostNotes
Simple classificationClaude Haiku~$0.001Yes/no, category, sentiment
Test case generationClaude Haiku~$0.003Generates 5-10 test cases
Code review (single file)Claude Sonnet~$0.02Detailed review with suggestions
Complex analysisClaude Opus~$0.12Multi-file reasoning
SummarizationGemini Flash~$0.0005Fastest cheap cloud option
Extraction (structured)OpenAI GPT-4o-mini~$0.002JSON extraction
Any jobOllama / LM Studio$0.00Local models — your hardware only

Use local models (Ollama, LM Studio) when:

Section titled “Use local models (Ollama, LM Studio) when:”
  • The task is formulaic: classification, extraction, reformatting, templated generation.
  • You don’t want a third-party AI provider in the loop: the prompt and completion stay in the local inference process on your hardware.
  • Volume is high: hundreds of similar jobs where per-job cost adds up.
  • Quality requirements are moderate: you need “good enough,” not “best possible.”
  • You’re iterating: rapid prompt development where you’ll run the same job dozens of times.
  • The task requires complex reasoning: multi-step analysis, architectural decisions, subtle code bugs.
  • Output quality is critical: customer-facing content, security audits, compliance analysis.
  • The job needs a large context window: 100k+ tokens of input.
  • You need specific capabilities: vision, function calling, structured output guarantees.

Use the saddle’s effort tiers to route by complexity. Trivial and quick jobs go to local workers. Standard and above go to cloud.

Effort TierTypical routing
trivialLocal (Ollama)
quickLocal or cheap cloud (Haiku)
standardCloud (Haiku/Sonnet)
deepCloud (Sonnet)
criticalBest available (Opus)

Strategy 2: Local-first with cloud fallback

Section titled “Strategy 2: Local-first with cloud fallback”

The saddle’s local mode forces all dispatch to registered Ollama/LM Studio workers. If no local worker is available, dispatch returns 503 — no silent cloud fallback, no surprise bills.

Run both local and cloud workers. The router picks based on job tags, effort tier, and worker availability. Most of your bill will come from the small percentage of jobs that actually need cloud reasoning.

  • Start local, add cloud incrementally. Run everything on Ollama first. Identify which jobs actually need cloud quality, and only route those.
  • Haiku is almost always enough. For tasks that need cloud providers, Haiku handles most cases at a fraction of the cost of larger models.
  • Review your job history. The dashboard shows recent jobs with assigned workers. Look for expensive jobs that could be downgraded.
  • Fan-out costs multiply. Dispatching to 3 workers at once runs 3 jobs. Use fan-out for comparisons, not for routine work.