LLM Cost Optimization — A Practical Engineer's Guide

The cost of running an LLM app is rarely the limiting factor in v1. It often becomes the limiting factor in v10, when usage scales and the bill suddenly looks like it could swallow your runway. Here are the levers that bring the bill down without sacrificing quality — and the discipline of measuring before you optimize.

Where the money goes

Audit a single end-to-end interaction. Count the tokens going in and out, the number of model calls, and the model size used at each step. The result usually surprises people: most cost is in the largest model handling work that a smaller one could do.

Make a flame graph of cost per interaction. The biggest box is your optimization target.

Lever 1: Right-size each call

Don't use your top-tier model for everything. A typical pipeline has:

A router or classifier → cheap, fast, small model (Haiku, GPT-4o-mini, Gemini Flash)
A retrieval reformulation step → small model
A summarization step → mid-tier model
The main reasoning step → top-tier model
A formatting/cleanup step → small model

The art is putting the expensive model only where it earns its keep. For most pipelines, only 1-2 steps actually need the top model. The rest can be 5-10x cheaper.

Lever 2: Prompt caching

Most providers offer prompt caching where the cached portion costs a fraction of normal input tokens (typically 10-25% of full price). Structure prompts so the long, stable parts — system prompt, tool definitions, retrieved context that repeats — come first.

Cache hit rates above 70% are achievable and dramatically reduce cost. Some teams report 80%+ cost reduction from caching alone.

The trick: keep your system prompt stable. Resist the urge to interpolate per-request data into the system prompt; put it in the user message instead. Cache hit requires byte-exact prefixes.

Lever 3: Output length

Output tokens cost 3-5x input tokens. Be specific about response length in your system prompt. Use structured outputs to avoid verbose reasoning that doesn't help the user. "Answer in one sentence" can cut response cost by 80%.

When the model has a chance to be verbose, it will be. Constrain it.

Lever 4: Embed instead of call

For repeated lookups — "is this a duplicate question?", "which topic does this map to?", "is this query similar to one we've seen?" — a cached embedding lookup is 100x cheaper than an LLM call.

Use embeddings for similarity work; reserve the LLM for generation. Many "agent" steps that look like LLM calls are actually classification tasks an embedding search can do at a fraction of the cost.

Lever 5: Batching

If your workload is asynchronous (overnight processing, batch annotation, data labeling), use the batch API. 50% discount, slower turnaround. Major providers all offer this; few teams use it.

The friction is operational — batching requires you to think in batches, not requests. The savings are usually worth it for any non-real-time workload.

Lever 6: Stop generating early

Use stop tokens, structured outputs, or early-termination logic to avoid paying for tokens you don't need. Especially relevant for yes/no classification framed as full responses.

If you only need the first 200 tokens of an answer, set max_tokens to 200. Don't let the model decide.

Lever 7: Cache outputs aggressively

If a non-trivial fraction of your queries are duplicates or near-duplicates, cache the output. Hash the input, store the output. On a hit, skip the LLM entirely.

For semantic dedup, use embeddings + a similarity threshold. For high-volume support assistants, semantic caching can reduce calls by 30-50%.

Lever 8: Right-size your context

Retrieving 50 chunks "to be safe" sounds prudent and costs real money. Retrieve fewer, rerank for quality, only pass the top results to the model. A well-tuned retriever with 5 chunks beats a sloppy one with 50.

What to measure

Cost per user interaction (track this weekly)
Cost per model in the pipeline (catches creep where someone swapped a small model for a big one)
Cache hit rate (target 60%+, ideally 80%+)
Average output tokens per call (catches verbosity drift)
Cost per successful task (interactions that succeed are the right denominator, not raw calls)

If you're not tracking these, you can't optimize them. Build the dashboard before the bill scares you into action.

What NOT to do

Premature optimization — get the system working at any cost first. Optimize once you have evals and traffic.
Switching to a smaller model without re-evaluating — cheaper is only better if it still passes your eval set. Run the evals after every swap.
Caching without invalidation — stale cached outputs are worse than fresh expensive ones. Have a clear invalidation strategy.
Optimizing the wrong call — measure first. The intuition is often wrong about where the cost is.

The discipline

Cost optimization for LLM apps is engineering, not magic. Measure, identify the largest box, change one thing, re-measure, repeat. Most teams that complain about LLM costs haven't done the measurement yet.

Built into the JoinAI MasterClass: a full week on production observability and cost engineering, with real dashboards across the three deployed agents.

Cost Optimization Strategies for LLM-Powered Apps