Cost Optimization

ModelReins makes AI jobs cheap by routing them to the right provider. Local models cost nothing. Cloud models cost pennies per job. The router picks the cheapest option that meets the quality bar.

Approximate costs per job

These are approximate costs based on typical API pricing for common job sizes (500-2000 token input, 200-800 token output). Your actual costs depend on prompt length, output length, and provider pricing.

Job Type	Provider / Model	Approx Cost	Notes
Simple classification	Claude Haiku	~$0.001	Yes/no, category, sentiment
Test case generation	Claude Haiku	~$0.003	Generates 5-10 test cases
Code review (single file)	Claude Sonnet	~$0.02	Detailed review with suggestions
Complex analysis	Claude Opus	~$0.12	Multi-file reasoning
Summarization	Gemini Flash	~$0.0005	Fastest cheap cloud option
Extraction (structured)	OpenAI GPT-4o-mini	~$0.002	JSON extraction
Any job	Ollama / LM Studio	$0.00	Local models — your hardware only

When to use local vs cloud

Use local models (Ollama, LM Studio) when:

The task is formulaic: classification, extraction, reformatting, templated generation.
You don’t want a third-party AI provider in the loop: the prompt and completion stay in the local inference process on your hardware.
Volume is high: hundreds of similar jobs where per-job cost adds up.
Quality requirements are moderate: you need “good enough,” not “best possible.”
You’re iterating: rapid prompt development where you’ll run the same job dozens of times.

Use cloud models when:

The task requires complex reasoning: multi-step analysis, architectural decisions, subtle code bugs.
Output quality is critical: customer-facing content, security audits, compliance analysis.
The job needs a large context window: 100k+ tokens of input.
You need specific capabilities: vision, function calling, structured output guarantees.

Routing strategies for cost

Strategy 1: Tiered routing (recommended)

Use the saddle’s effort tiers to route by complexity. Trivial and quick jobs go to local workers. Standard and above go to cloud.

Effort Tier	Typical routing
trivial	Local (Ollama)
quick	Local or cheap cloud (Haiku)
standard	Cloud (Haiku/Sonnet)
deep	Cloud (Sonnet)
critical	Best available (Opus)

Strategy 2: Local-first with cloud fallback

The saddle’s local mode forces all dispatch to registered Ollama/LM Studio workers. If no local worker is available, dispatch returns 503 — no silent cloud fallback, no surprise bills.

Strategy 3: Mixed fleet

Run both local and cloud workers. The router picks based on job tags, effort tier, and worker availability. Most of your bill will come from the small percentage of jobs that actually need cloud reasoning.

Tips

Start local, add cloud incrementally. Run everything on Ollama first. Identify which jobs actually need cloud quality, and only route those.
Haiku is almost always enough. For tasks that need cloud providers, Haiku handles most cases at a fraction of the cost of larger models.
Review your job history. The dashboard shows recent jobs with assigned workers. Look for expensive jobs that could be downgraded.
Fan-out costs multiply. Dispatching to 3 workers at once runs 3 jobs. Use fan-out for comparisons, not for routine work.