Testing in the Era of AI – The Agile Monkeys Labs

The arrival of AI agents has radically transformed how we interact with software. However, there is a massive gap between “playing around” with an agent in a chat interface and orchestrating a robust solution in a production environment.

Recently, I set out to experiment with how these advancements could facilitate the automation of End-to-End (E2E) tests. My goal was simple: reduce the friction of manual testing without sacrificing reliability. What started with simple tests in ChatGPT evolved into a scalable technical architecture based on Stagehand, Browserbase, and Gemini 2.5 Flash.

Below, I detail the technical evolution of this implementation, the obstacles encountered, and how we managed to reduce testing times by 70%.

Phase 1: The Limitation of “Agent Mode” (ChatGPT)

My first approach was to use ChatGPT’s Agent Mode. The premise is seductive: describe a user flow in natural language and watch the AI navigate, click, and validate results.

The technology works, the results are surprising, and the agent’s capability borders on the magical. For prototyping quick flows, it is an exceptional tool. However, when attempting to translate this into an engineering environment, the limitations became evident:

Lack of CI/CD Integration: The execution lives isolated within the chat. There is no API to trigger these tests automatically after a Pull Request.
Collaboration and Versioning: Relying on manually copy-pasting prompts eliminates any real automation benefit. Without code, there is no version control (Git), no execution history, and team collaboration becomes chaotic.
Human Dependency: In flows requiring authentication or credential management, the agent often halts, waiting for human intervention.
Non-deterministic Environment: Since you don’t control the browser or the execution context, reproducing a specific error for troubleshooting is practically impossible.

Conclusion: Chat is ideal for assisting, iterating, and validating. Code is what allows us to scale, automate, and monitor.

Phase 2: Programmatic Architecture with Stagehand

To address the scalability and operational issues, I migrated to Stagehand. This tool abstracts away the complexity of “human navigation” but lives within our code (TypeScript).

The brilliance of Stagehand is that it doesn’t try to reinvent the wheel; instead, it introduces intention-based interaction primitives:

act() (The Executor): Replaces the traditional click/type. You don’t tell it where to click (selector); you tell it what you want to achieve.
- Ex: await stagehand.act("Log in with the saved credentials");
extract() (The Analyst): Converts unstructured HTML into typed JSON objects.
- Ex: “Extract the price and stock of this product.”
eval() (The Judge): The evolution of assertions. It evaluates logical conditions based on visual and contextual information.
- Ex: const isError = await stagehand.eval("Is there a wrong password error?");
observe() (The Eyes): Allows the script to understand what interactive elements exist in the current state before making a decision.

With these primitives, we shift from “giving instructions to the DOM” to describing behaviors.

The Technical Architecture: Browserbase and Gemini

While Stagehand is versatile and can run locally or in headless mode on GitHub Actions, basic execution often falls short for a demanding production environment. Validating on a standard runner works but lacks the deep observability necessary for a modern QA pipeline.

That is where we consolidated the technical architecture:

1. Infrastructure and Persistence (Browserbase)

Delegating browser execution to Browserbase solved two critical problems:

Context Management (Persistent Sessions): One of the biggest bottlenecks in E2E is the time lost on repetitive logins. Leveraging the Contexts functionality, we can authenticate once, save the state (cookies, storage), and reuse it. This allows us to run tests in parallel with different roles without repeating authentication in every case.
Advanced Observability: Traditional text logs (Error: element not found) are insufficient for understanding visual glitches or intermediate states. Browserbase enables Session Replay and execution across multiple viewports, allowing us to debug failures by seeing exactly what happened on the screen during each run.

2. Model Optimization (Gemini 2.5 Flash)

Initially, tests were conducted with heavier models like GPT-4. Although accuracy was high, the latency per instruction penalized the total suite time.

Following Stagehand community benchmarks comparing cost, accuracy, and speed, we migrated the inference engine to Gemini 2.5 Flash. The results were immediate:

Latency Reduction: Execution time per test dropped from ~30 seconds to ~15 seconds.
Cost Efficiency: Being a model optimized for high frequency, the operational cost decreased drastically, allowing continuous runs on every commit without compromising the budget.

Technical Comparison: Imperative vs. Declarative

The deepest change isn’t just the tools, but the paradigm. Until now, UI testing was written in imperative code, which is very brittle in the face of frontend changes. Stagehand, with its AI integration, allows us to define tests based on intentions.

Traditional Approach (CSS/XPath Selectors): Requires knowing the specific name of identifiers. It is fragile: if a developer changes a CSS class, the test breaks.

// If the ID changes or the button moves, the test fails
await page.type('#username-field', process.env.USER);
await page.click('.submit-btn-wrapper > button');

// Cryptic logs if it fails
await expect(page.locator('.dashboard-header')).toBeVisible();

Modern Approach (Stagehand + AI): More resilient and readable. The AI interprets the interface as a human would, regardless of changes in the underlying code.

// The instruction describes the intention, the AI solves the execution
await stagehand.act("Log in with standard user credentials");
await stagehand.act("Navigate to the configuration panel");

// Semantic validation
const isVisible = await stagehand.eval("Verify that the sales chart is visible");
if (!isVisible) throw new Error("The chart did not load");

Instead of tying ourselves to a selector name, we tie ourselves to the behavior expected by the user.

Use Cases and Conclusion

This architecture is not limited to QA. The ability to navigate the web intelligently opens doors to broader automations:

RPA (Robotic Process Automation): Automating tedious administrative tasks, such as monthly invoice downloads or routine dashboard updates.
Synthetic Monitoring: Periodic validation of critical flows in production from different geolocations and devices.
Resilient Data Scraping: Extracting data from highly dynamic sites where traditional scrapers, based on rigid selectors, frequently fail.

The era of maintaining manual CSS selectors is coming to an end. The era of designing behaviors and letting AI execute has just begun.