At AI Findr, we help eCommerce brands turn traditional search experiences into assisted shopping journeys and personal shopper-style conversations. Because AI Findr is a SaaS product serving many different businesses, a core capability of the platform is letting each business connect almost any product-search API to our agent workflow.
That flexibility accelerated adoption, but it also created a recurring problem in some tenant flows. Different stores returned different payload shapes, many product-search tool responses included far more fields than the model actually needed to understand the user’s intent, and some LLM responses still carried too much UI-oriented structure instead of a minimal rendering contract. As that context piled up, token spend and latency went up with it.
In one production path, the same query dropped from 34,302 to 10,100 total tokens and from 28.16s to 6.76s after we redesigned the workflow around normalized product contracts, minimal output schemas, context whitelisting, and ID-based product hydration.
This post breaks down those architecture decisions and why they improved both unit economics and user experience.
From Flexible Integrations to Production Constraints
As a SaaS product, we are not integrating one clean product catalog. We need to support many different businesses, each with its own search API, payload shape, and data quality quirks. Unless those differences are normalized early, integration flexibility turns into token overhead, latency, and frontend fragility.
Where Latency Was Really Coming From
The bottleneck was not just the model. It was everything we let flow through it.
In some tenant flows, the LLM was getting too much product data, output schemas that were too broad, and response structures that were too close to UI rendering.
The model was doing too much. More context in, more structure out, more latency and cost.
The Four Optimizations
1. Post-Process Raw Search Tool Responses
The first problem was the raw search response itself. Large client payloads can quickly flood an agent’s context window and make it slow, expensive, or unusable.
We solved that with configurable Go-template post-processing per client: clean the raw response, keep only the fields we actually need, and apply the compact TOON format before passing anything downstream.
2. Keep Structured Output Schemas Minimal
We were already using Structured Outputs from the OpenAI SDK, so format reliability was not the main issue. The issue was that one shared output schema had become too large as the product grew.
We fixed that by adding client-configurable output modes and defaulting to the smallest possible schema. In many cases, the model only needs to return text_response, which makes completions cheaper and faster. That matters even more here because output tokens cost 8x more than uncached input tokens at the pricing used in this article.
3. Send Only the Field Subset the Agent Needs
Even after post-processing, the LLM still did not need the full payload. Fields like image URLs are expensive in tokens and not useful for reasoning.
So we sent only a whitelisted subset into model context, while the frontend received the full post-processed payload via SSE to hydrate rich product details.
4. Replace a Heavy UI DSL with a Lightweight Selector Tag
We used to make the model generate full product card structures. That was too expensive for the LLM and too fragile for the UI.
We replaced that with a much simpler selector tag:
<PS>[id1,id2,id3]</PS>The model now returns only product IDs, and the frontend hydrates those cards using product data it already received via SSE. That gives the LLM a much smaller contract to produce and gives the UI a much more reliable one to render.
A Simplified View of the Flow
After those four changes, the architecture becomes much simpler: the model only sees a compact contract, and the frontend renders from deterministic product memory.
Business Impact and Trade-Offs
The business impact was straightforward: faster answers, lower cost per conversation, and a more consistent product experience. Once the model stopped carrying oversized payloads and rendering-heavy structures, the UX became more responsive and the unit economics became easier to control.
The trade-off is that more responsibility moves into application logic. You need stable product IDs and clear hydration rules across stores. In practice, that is usually a good trade, because deterministic application logic is easier to debug and scale than pushing more structure through the model.
If you only measure three things, measure input tokens per turn, end-to-end latency (especially median and P95), and cost per conversation.
Before vs After: Same Query
Using the same query in both versions of the flow:
| Metric | Before | After | Delta |
|---|---|---|---|
| Input tokens | 31,970 | 9,705 | -69.6% |
| Output tokens | 2,332 | 395 | -83.1% |
| Total tokens | 34,302 | 10,100 | -70.6% |
| End-to-end latency | 28.16s | 6.76s | -76.0% (4.17x faster) |
This is a single-query reference point, but it clearly shows the direction and magnitude of impact.
What This Looks Like at 300,000 Queries
If this query were representative of 300,000 queries, the operational impact would be:
| Metric | Before | After | Savings |
|---|---|---|---|
| Total token volume | 10.291B | 3.030B | 7.261B fewer tokens |
| Cumulative user wait time | 97.8 days | 23.5 days | 74.3 days less waiting |
And using GPT-5.1 pricing as of March 4, 2026 ($1.25 / 1M input, $0.125 / 1M cached input, $10 / 1M output), the cost impact would be:
| Scenario | Before | After | Savings |
|---|---|---|---|
| No cached input tokens (0%) | $18,984.75 ($0.0633/query) | $4,824.38 ($0.0161/query) | $14,160.38 (-74.6%) |
| 20% cached input tokens | $16,826.78 ($0.0561/query) | $4,169.29 ($0.0139/query) | $12,657.49 (-75.2%) |
This is a directional estimate, not a measured production average.
Final Takeaway
The less context an agent has to process, the faster and cheaper it will be. That sounds obvious, but it is easy to lose that discipline as a product grows.
More tenants, more integrations, and more product requirements usually mean more fields, broader outputs, and more rendering logic creeping into the model path. Unless those boundaries stay tight, inefficiency becomes the default.
In practice, the biggest gains often come from better contracts, not bigger models. Optimize what goes around the model, not just the prompt inside it.
Start by auditing one high-traffic tool-output path and measuring token, latency, and cost deltas before expanding.