AI Evals Aren’t a Tool. They’re a Mindset.

Over the last few years, we’ve been deeply invested in understanding how to evaluate the performance of our AI systems, especially our search and chat experiences. Along the way, we’ve built several prototypes, dashboards, and tools, each one teaching us something new. But despite all the progress, we noticed a pattern: we kept building interfaces before fully understanding what evaluating really means.

At some point, we realized that evaluation isn’t about fancy UIs or sophisticated platforms. Evals is not a UI, or a dashboard, or a platform — it’s about data. It’s about collecting, transforming, and interpreting data to understand, at any moment, how your systems are performing and how they’re improving over time.

What Really Matters in Evaluations

⚙️ Process data at scale.
We generate millions of events every month from our search and chat systems. The first challenge is building reliable data transformation pipelines — in our case, we lean on virtually unbounded cloud databases and lambdas to collect and process events efficiently, but there are many ways to do this. The key is ensuring the data can flow, transform, and aggregate without friction.

📏 Define the right metrics.
Decide what you want to measure and why. Sometimes it’s simple metrics (like response times or number of results returned), and sometimes it’s more complex (like LLM-based verdicts or semantic accuracy scores). These definitions often come more from product and business needs than from engineering, but they determine whether you have your product under control or not.

📊 Visualize results clearly.
Each metric should be visible and easy to interpret — whether in a table, a chart, or a trend graph. The goal is clarity, not complexity.

🧩 Evals are a source of very nice reference datasets.
Evaluations are an efficient way to label data at scale, and making a human-curated selection of this data produces high-quality datasets that you can use to benchmark new versions of your system before they go live, and to fine-tune or train better models over time (increasing quality and performance and reducing cost).

Borrowing the Mindset from Product Analytics

In many ways, building AI evals isn’t so different from tracking product KPIs. The main difference is that much of our data is unstructured — often text — and requires LLMs to interpret. But the principles are the same: focus on data, metrics, and insights, not on reinventing analytics tooling.

That’s why we now rely on existing analytics platforms and standard cloud services to process and visualize events and evaluation results. That way we spend most of our time where it truly matters — defining metrics, building robust pipelines, and making decisions based on real data.

By fostering a culture of continuous evaluation — one built on clear metrics and real datasets — we’re not just measuring quality. We’re creating the foundations to improve it every single day.

And that’s the real goal of AI evals: not building another tool, but building understanding.