AGI Is Still a Decade Away, And the Real Work Starts Now

This article summarizes the main themes and insights that sparked our internal debate after listening to Andrej Karpathy’s recent conversation with Dwarkesh Patel. It captures the points that resonated most with our team — what we found inspiring, debatable, or worth exploring further — and how these ideas connect to our day-to-day work building AI-powered products.

The “Cognitive Core” and Externalized Memory

Karpathy’s central idea is to separate a model’s reasoning from its memory. Instead of storing vast knowledge in weights, he suggests keeping a compact “cognitive core” focused on reasoning and problem-solving, while retrieving facts and context from external systems. This underscores the growing importance of information retrieval and echoes our belief that GenAI should be viewed as an enabler for processing and understanding unstructured data—not as a database. The real opportunity lies in building powerful retrieval systems that amplify smarter reasoning models, especially when working with private or proprietary data that isn’t publicly available.

The Decade of Agents — Not the Year

Karpathy predicts that this will be the decade of agents, not the year. Building robust autonomous systems still requires advances in continual learning, multimodal reasoning, and reliable computer use. Our team agreed: it’s more realistic to focus on assistive agents with bounded scope and autonomy, capable of executing tasks under human supervision, than to chase the mirage of full autonomy.

Why Software Is AI’s Natural Beachhead

As discussed in our session, software development is uniquely suited for AI integration because outcomes are measurable and reversible. Code either compiles or fails, tests can verify correctness, and metrics quantify performance. This quantifiability — as Andrej noted during the discussion — explains why AI-assisted coding has advanced faster than creative or legal applications. Software is the perfect playground for measurable, iterative improvement. What other domains share these measurable and reversible qualities? Use cases where outputs can be deterministically validated, tested, and audited may represent the next wave of near‑term opportunities for applied AI.

Smaller Models, Bigger Leverage

We were struck by Karpathy’s emphasis on the potential of small, specialized models. Many tasks don’t require billion-parameter LLMs; a smaller model combined with retrieval and rules can achieve similar quality at a fraction of the cost. Our team highlighted how this approach can significantly reduce infrastructure costs and enhance performance—particularly in production environments where token usage scales fast and response speed directly shapes perceived product quality. We’ve been applying this principle by fine‑tuning smaller models to handle tasks once reserved for the largest ones. The key is data: start with a powerful model to generate high‑quality examples, curate real usage data, and iteratively refine the smaller model until it meets or surpasses an acceptable quality threshold.

The Value of Continuous Evaluation

Another recurring theme was the importance of evals as a strategic process, not just QA. Teams that continuously transform user data and incidents into automated evaluations will build faster, safer, and more reliable systems. Static, dataset‑driven evaluations remain invaluable—they serve as the foundational complement to unit tests before deploying an AI‑first application. At the same time, our team is heavily investing in live evaluations that continuously benchmark real‑world usage. These systems help us respond to unexpected usage patterns and serve as a rich source of new, high‑quality samples to refine and expand our static evaluation datasets over time.

Where the Debate Remains Open

We discussed several open questions:

How small can a cognitive core realistically be before quality collapses?
When does it make sense to fine-tune or build custom models instead of relying on well-prompted general ones?
How do we design memory layers that combine persistent user context, transient session data, and structured retrieval systems?

Closing Thoughts

Karpathy’s conversation was a refreshing reminder that progress in AI is as much about engineering discipline as about model scale. As a team, we found inspiration in his pragmatism: simplify architectures, externalize memory, prioritize measurable outcomes, and view agents as tools for augmentation, not autonomy. The future of AI is exciting, and this kind of discussions remind us that we’re just starting this journey.

These reflections were compiled from an internal discussion by The Agile Monkeys team following Andrej Karpathy’s interview with Dwarkesh Patel (October 2025). The recording and full transcript are available at dwarkesh.com/p/andrej-karpathy.