When a team tells us their AI product is “too expensive,” we almost never solve it by switching models. We solve it in the retrieval layer — by chunking better, ranking better, and admitting when retrieval has failed before we ever call the model.
The cost of a bad retrieval is not the tokens. It is the user trust you spend on a hallucinated answer.
Three rules we keep returning to
- Measure retrieval quality on your own data. Public benchmarks are useful for picking a vendor, useless for picking a configuration.
- Make retrieval failures legible. If the system can’t find relevant context, the UI must say so. Never paper over it with the model.
- Optimise the surface, not the score. A retrieval system with 88% recall and a great refusal UX outperforms one with 94% recall that bluffs.
We will write more on the evaluation harness we use, including the way we run retrieval, generation, and end-to-end suites independently — in a future entry.