LLM-Powered Codebase Analysis: Technical Due Diligence Before Software Estimation

When a mature system lands on your desk with the instruction “just add this one capability,” the most dangerous thing you can do is start coding too early.

In older systems, the estimate is rarely about code volume. It’s about unknowns: hidden assumptions, undocumented flows, infrastructure that lives outside the repo, security controls that only exist in somebody’s head. The gap between what you think the system does and what it actually does is where estimates blow up.

We’ve been using LLMs for a very specific part of that problem: technical due diligence before implementation. The model reads an unfamiliar codebase fast enough that we can answer the question that actually matters: should we integrate with what already exists, extend it, or rebuild the capability from scratch?

The useful shift has been simple: let the LLM produce a map of the codebase, then make humans confirm that map before anyone commits to an estimate.

Most “implementation estimates” are really estimates of hidden system behavior

On greenfield work, scope is usually visible. On mature systems, it often isn’t.

A request that sounds small can sit on top of a surprising amount of invisible machinery. Routing rules exist but aren’t documented. Background jobs only become obvious once you trace them end to end. Authorization assumptions hide inside service boundaries. Operational controls live outside the application layer entirely. Downstream notifications, retries, and failure handling never came up in the original request.

This is where teams get trapped. Somebody hears “we need feature X,” imagines a clean standalone implementation, and estimates accordingly. Two days later, the real work shows up: “Actually it has to fit the existing intake path, reuse the current workflows, preserve existing security controls, and behave consistently with the rest of the platform.”

At that point, you’re no longer estimating implementation. You’re paying interest on unknowns.

We’ve found LLMs are unusually good at compressing that discovery phase.

Start by asking for a flow map

The first thing we want from the model is a map of what exists, not a recommendation for what to build.

A good prompt is closer to technical archaeology than brainstorming:

Read this codebase and map the end-to-end flow for capability X. Cite the files, routes, handlers, jobs, and config involved. Separate clearly between: 1) behavior directly supported by source code 2) likely behavior inferred from naming or structure 3) important unknowns that require human confirmation

💡 A note on prompts: These are starting points, not recipes. Every codebase has its own idioms, and you’ll need to adjust wording, scope, and follow-up questions to get useful output. If you copy these verbatim and get mediocre results, iterate on them before writing off the approach.

That three-way distinction matters. Without it, the model blends observation and invention into one very confident paragraph. You get a map you can’t trust.

What we want instead is a reviewable pipeline:

graph LR
  A["New request"] --> B["Entry point"]
  B --> C["Routing / orchestration"]
  C --> D["Validation & authorization"]
  D --> E["Storage / persistence"]
  E --> F["Downstream processing"]
  F --> G["Notifications / side effects"]
  G --> H["Error handling / retries"]

That artifact makes the existing system discussable. Vague “I think it works like this” turns into something the team can actually review. And it immediately exposes where the request assumes something exists that doesn’t, or ignores something that does.

We’ve had good results cross-checking the same mapping task with more than one model, then comparing outputs. Disagreement between models is informative: if two produce different maps from the same repo, humans should slow down and look closer.

Use the flow map to generate the questions that change the estimate

Once we have a draft map, the next step is structured questioning.

This is where the LLM earns its keep as a question generator for missing work. We ask something like:

Given this observed flow, what security, operational, and scope questions must be answered before deciding whether to integrate, extend, or rebuild? Group them by risk area and keep each question specific.

The checklist that comes back is rarely publishable as-is, but it’s a strong starting point. We rewrite it in our own words and send it for confirm-or-correct review.

The categories that repeatedly change the estimate are consistent enough to be worth listing:

Authentication and trust boundaries. Where does this input come from? Where is identity established? What parts of the trust chain live in the application versus external infrastructure?
Authorization and tenancy. Who can trigger this flow, how is access mapped to the right account or tenant, and what happens when identity is known but permission isn’t?
Validation and content safety. Are we validating structurally, semantically, or both? Are there hidden rules around formats, payload sizes, or attachment types? Where do malformed inputs go?
Abuse prevention and backpressure. Where is throttling enforced? What happens under spikes, replayed requests, or noisy neighbors? Are retries idempotent, bounded, and observable?
Downstream side effects. Which jobs, notifications, webhooks, or derived artifacts depend on this path? What must happen synchronously versus eventually? What constitutes partial success?

This is the point where a supposedly small feature reveals itself as either a straightforward extension, an integration into a partially built capability, or a genuine architectural mismatch that justifies a rebuild.

Without the checklist, teams discover that late. With it, they discover it while decisions are still cheap.

LLMs are surprisingly good at spotting scope drift

One of the more valuable patterns we’ve found is using the model as a contradiction engine.

Once the existing flow is mapped, we give the model the observed architecture (supported by code and config) alongside the newly requested behavior (as described by stakeholders). Then we ask it to highlight tension:

Compare the requested change with the current implementation. What assumptions would break if we implement it the proposed way? Which parts look like an extension of the current design, and which parts imply a different architecture?

This works well because scope drift enters projects in harmless-looking language. “Can we make it work through this other path too?” “Can we move this responsibility to another component?” “Can we simplify by bypassing the existing workflow?” Those sentences sound small. A model that has already built a repo-level map can surface the consequences quickly: where trust boundaries shift, where orchestration gets duplicated, where isolation breaks, where failure handling becomes inconsistent, or where the operational burden exceeds anything the original estimate accounted for.

We treat that output as a sharp draft of the conversation humans need to have, not as a conclusion in itself.

The guardrails that keep this useful

This approach only works if the team is disciplined about how the output is used. Our operating rules:

Every claim should trace back to source: files, handlers, config, jobs, tests, or logs. If the model can’t point to evidence, the claim stays provisional.
Use confirm-or-correct review. The goal is helping maintainers validate or falsify the map quickly, not producing a document that looks authoritative on its own.
Separate observed behavior from desired behavior. Mature systems are full of “it should do this” stories. That’s not the same as “it does this today.”
Treat infrastructure as a first-class unknown. A repo rarely contains the whole truth. Critical behavior may live in gateways, queues, policies, or provider configuration.
Pull security and operations questions earlier, not later. The model shouldn’t replace security review. It should make sure it happens before anyone commits to a plan.

This makes the LLM less magical and more useful. It becomes a force multiplier for engineering judgment.

Key takeaway

The best use we’ve found for LLMs in “integrate vs rebuild” decisions is reducing ambiguity before planning begins. If the model can help you produce a reviewable end-to-end flow map, a concrete checklist of hidden security and operational assumptions, and a clear view of where the new request contradicts the current design, you’ve already shortened the riskiest part of the project.

Before your team commits to the next “small” change in a mature system, try letting the LLM read the codebase first. Force it to cite its work. Make humans confirm the map. Then decide whether to integrate, extend, or rebuild.