Reducing LLM Cost and Latency in Production AI Agents: A Case Study

At AI Findr, we help eCommerce brands turn traditional search experiences into assisted shopping journeys and personal shopper-style conversations. Because AI Findr is a SaaS product serving many different businesses, a core capability of the platform is letting each business connect almost any product-search API to our agent workflow.

That flexibility accelerated adoption, but it also created a recurring problem in some tenant flows. Different stores returned different payload shapes, many product-search tool responses included far more fields than the model actually needed to understand the user’s intent, and some LLM responses still carried too much UI-oriented structure instead of a minimal rendering contract. As that context piled up, token spend and latency went up with it.

In one production path, the same query dropped from 34,302 to 10,100 total tokens and from 28.16s to 6.76s after we redesigned the workflow around normalized product contracts, minimal output schemas, context whitelisting, and ID-based product hydration.

This post breaks down those architecture decisions and why they improved both unit economics and user experience.

From Flexible Integrations to Production Constraints

As a SaaS product, we are not integrating one clean product catalog. We need to support many different businesses, each with its own search API, payload shape, and data quality quirks. Unless those differences are normalized early, integration flexibility turns into token overhead, latency, and frontend fragility.

Where Latency Was Really Coming From

The bottleneck was not just the model. It was everything we let flow through it.

In some tenant flows, the LLM was getting too much product data, output schemas that were too broad, and response structures that were too close to UI rendering.

The model was doing too much. More context in, more structure out, more latency and cost.

The Four Optimizations

1. Post-Process Raw Search Tool Responses

The first problem was the raw search response itself. Large client payloads can quickly flood an agent’s context window and make it slow, expensive, or unusable.

We solved that with configurable Go-template post-processing per client: clean the raw response, keep only the fields we actually need, and apply the compact TOON format before passing anything downstream.

2. Keep Structured Output Schemas Minimal

We were already using Structured Outputs from the OpenAI SDK, so format reliability was not the main issue. The issue was that one shared output schema had become too large as the product grew.

We fixed that by adding client-configurable output modes and defaulting to the smallest possible schema. In many cases, the model only needs to return text_response, which makes completions cheaper and faster. That matters even more here because output tokens cost 8x more than uncached input tokens at the pricing used in this article.

3. Send Only the Field Subset the Agent Needs

Even after post-processing, the LLM still did not need the full payload. Fields like image URLs are expensive in tokens and not useful for reasoning.

So we sent only a whitelisted subset into model context, while the frontend received the full post-processed payload via SSE to hydrate rich product details.

4. Replace a Heavy UI DSL with a Lightweight Selector Tag

We used to make the model generate full product card structures. That was too expensive for the LLM and too fragile for the UI.

We replaced that with a much simpler selector tag:

<PS>[id1,id2,id3]</PS>

The model now returns only product IDs, and the frontend hydrates those cards using product data it already received via SSE. That gives the LLM a much smaller contract to produce and gives the UI a much more reliable one to render.

A Simplified View of the Flow

After those four changes, the architecture becomes much simpler: the model only sees a compact contract, and the frontend renders from deterministic product memory.

Architecture flow from raw payloads to frontend hydration

Business Impact and Trade-Offs

The business impact was straightforward: faster answers, lower cost per conversation, and a more consistent product experience. Once the model stopped carrying oversized payloads and rendering-heavy structures, the UX became more responsive and the unit economics became easier to control.

The trade-off is that more responsibility moves into application logic. You need stable product IDs and clear hydration rules across stores. In practice, that is usually a good trade, because deterministic application logic is easier to debug and scale than pushing more structure through the model.

If you only measure three things, measure input tokens per turn, end-to-end latency (especially median and P95), and cost per conversation.

Before vs After: Same Query

Using the same query in both versions of the flow:

Metric	Before	After	Delta
Input tokens	31,970	9,705	-69.6%
Output tokens	2,332	395	-83.1%
Total tokens	34,302	10,100	-70.6%
End-to-end latency	28.16s	6.76s	-76.0% (4.17x faster)

This is a single-query reference point, but it clearly shows the direction and magnitude of impact.

What This Looks Like at 300,000 Queries

If this query were representative of 300,000 queries, the operational impact would be:

Metric	Before	After	Savings
Total token volume	10.291B	3.030B	7.261B fewer tokens
Cumulative user wait time	97.8 days	23.5 days	74.3 days less waiting

And using GPT-5.1 pricing as of March 4, 2026 ($1.25 / 1M input, $0.125 / 1M cached input, $10 / 1M output), the cost impact would be:

Scenario	Before	After	Savings
No cached input tokens (0%)	$18,984.75 ($0.0633/query)	$4,824.38 ($0.0161/query)	$14,160.38 (-74.6%)
20% cached input tokens	$16,826.78 ($0.0561/query)	$4,169.29 ($0.0139/query)	$12,657.49 (-75.2%)

This is a directional estimate, not a measured production average.

Final Takeaway

The less context an agent has to process, the faster and cheaper it will be. That sounds obvious, but it is easy to lose that discipline as a product grows.

More tenants, more integrations, and more product requirements usually mean more fields, broader outputs, and more rendering logic creeping into the model path. Unless those boundaries stay tight, inefficiency becomes the default.

In practice, the biggest gains often come from better contracts, not bigger models. Optimize what goes around the model, not just the prompt inside it.

Start by auditing one high-traffic tool-output path and measuring token, latency, and cost deltas before expanding.