Safely Migrating Production from GPT-4o to GPT-4.1: An Evaluation-Driven Approach

How we migrated models in production while minimizing risks through a comprehensive multi-level evaluation system.
LLM
Evaluation
GPT-4
Production
Author

Tomás de las Casas, Jose Manuel Martinez, Nayara Rodríguez, Borja García at The Agile Monkeys

Published

July 24, 2025

TL;DR: Our evaluation showed that migrating from OpenAI’s GPT-4o to GPT-4.1 would deliver equal or better performance in semantic understanding and classification quality, without negatively impacting user experience.


Introduction

When OpenAI releases updates to its foundational models, engineering teams face a critical decision: when and how should we upgrade our production systems? While the GPT-4.1 model promises improvements in language understanding, inference speed, latency, and cost, switching models without proper evaluation can introduce regressions in business-critical tasks.

To address this challenge, we designed a multi-layer evaluation protocol that would give us confidence in our migration decision while minimizing risks to our production e-commerce application.


Our Evaluation Strategy

We structured our migration process around two complementary evaluation tracks:

Prompt-Level Evaluations

This track focuses on understanding how different models and prompts perform in isolation. We know exactly what the model should return for each query, making it easier to measure success objectively.

For this evaluation, we created a dataset containing 165 carefully crafted queries with their expected outputs. We then tested multiple prompt versions across both GPT-4o and GPT-4.1 to understand how each model responds to different instructions.

Application-Level Evaluations

This track analyzes the real-world impact by examining the final set of products returned by our e-commerce application when using different models and prompts.

Since we’re testing the complete system, we needed to create a golden dataset based on our actual product catalog. We ran real-world queries through our application pipeline using multiple prompt versions, then measured and compared the overall accuracy of different model and prompt combinations.


What We Tested

Our evaluation included:

Prompt Versions:

  • v25: Our current production prompt used with GPT-4o
  • v26: A new prompt designed specifically for GPT-4.1, featuring significantly reduced input tokens by removing tasks that could be handled reliably with code instead of the LLM
  • v27: A variation of v26 that reintroduces our domain-specific glossary to help the LLM better understand our e-commerce terminology

Models: GPT-4o and GPT-4.1

Evaluation Methods:

  • Prompt-level: Raw model responses analyzed through a dedicated REST endpoint, with outputs compared against expected classification and translation results
  • Application-level: Complete system testing using our golden dataset, measuring accuracy on a 0-1 scale by comparing returned product IDs against expected results

Results and Discussion

Prompt-level evaluation results

Our isolated prompt testing revealed promising results:

Prompt Version Model Pass Rate Execution Time
v25 gpt-4.1-2025-04-14 81.82% 14.26 seconds
v25 gpt-4o 85.86% 15.98 seconds
v26 gpt-4.1-2025-04-14 93.33% 15.24 seconds
v26 gpt-4o 91.72% 15.24 seconds
v27 gpt-4.1-2025-04-14 89.70% 14.74 seconds
v27 gpt-4o 89.29% 17.57 seconds

Key findings:

  • GPT-4.1 matched or exceeded GPT-4o performance across all three prompt versions
  • Both v26 and v27 showed improved pass rates compared to the baseline
  • GPT-4.1 consistently achieved higher pass rates than GPT-4o for the new prompt versions

Application-level evaluation results

When we tested the complete system using our golden dataset, the results were even more encouraging. We measured semantic classification accuracy by comparing actual outputs against our expected results.

Prompt Version Mean Score Median Score
v25 0.807 0.933
v26 0.873 0.965
v27 0.873 0.965

Key findings:

  • The newer prompt versions (v26 and v27) significantly outperformed our baseline v25 prompt.
  • Importantly, v27 matched v26’s performance while providing better domain-specific understanding, giving us confidence in our prompt optimization approach.

Conclusion

Our comprehensive evaluation demonstrated that GPT-4.1 with prompt v27 outperforms our current GPT-4o with v25 setup, both in isolated testing and real-world application scenarios.

The numbers speak for themselves:

  • Application accuracy improved from ~80.6% (v25) to ~87.3% (v27) on our golden dataset
  • GPT-4.1 achieved an impressive 91.72% pass rate on our evaluation dataset

This evaluation gave us the confidence to proceed with the migration, knowing we’re not just maintaining performance but actually improving it while benefiting from GPT-4.1’s enhanced speed and cost efficiency.


Experimental Section

Raw Evaluation Data

For full transparency and reproducibility, all our evaluation data is available:

Comparison Scores:

Prompt Level Evaluations API Test Responses:


Acknowledgments

Special thanks to the engineering and data teams at The Agile Monkeys for their meticulous work on evaluation design and implementation. Their rigorous approach made this confident migration possible.