Evals Don't Give You a Working Product

Evals are the holy grail of AI engineering. Or so we've been told.

Two years. Billions in VC funding. Thousands of blog posts about "production-ready agents." An entire industry built around evaluation frameworks, observability platforms, and benchmarks.

The result?

  • 11% of organizations have agents in production. [Deloitte]

  • 40%+ of agentic AI projects will be cancelled by 2027. [Gartner]

  • 80%+ never reach meaningful production. [RAND]

If evals were the answer, these numbers would be different.

Here's what I've learned after two years of shipping agents: passing evals ≠ working product. You can have a green test suite and a broken product. You can hit 95% on your benchmark and watch your agent choke the moment a real user touches it.

Evals don't get you to production. A working product does.

The Pitch vs. The Reality

Here's what the eval-industrial complex told us:

"Evals are the key to production-ready agents" — Databricks

Here's what actually happens:

You build an agent in a Python script. It works. You run your eval suite. Green lights everywhere. You demo it to stakeholders. They love it. Then you try to ship it.

Everything falls apart.

What Evals Don't Test

Your eval suite said the agent was ready. Here's what it missed:

Your agent isn't a function — it's a process. A single response might take 30 seconds. Or 3 minutes. Or 10 minutes if it's doing research. Traditional servers handle stateless request-response cycles in milliseconds. Your agent thinks, waits, calls tools, thinks again. Try fitting that into a Lambda with a 15-second timeout.

State breaks at scale. Works great with 1 user on 1 container. Add more users? State bleeds across sessions. Add more containers? State disappears entirely. Store it in memory? Gone when the process dies. Store it in a database? Now you're building infrastructure you didn't plan for.

Streaming is harder than it looks. In your notebook, responses just appeared. In production, users stare at a blank screen for 8 seconds wondering if the app crashed. You try SSE. Then WebSockets. Then you realize you need durable streams that survive network hiccups, handle backpressure, and resume gracefully after disconnects.

The real world doesn't mock. Your agent calls an external API. In testing, mocks returned clean data every time. In production, the API times out. Returns malformed JSON. Hits rate limits. Requires re-authentication mid-session. Your agent chokes. Your eval suite never saw it coming.

Agents fail because of an inadequate runtime, not intelligence. Evals don't measure any of it.

We've been obsessing over the brain while ignoring the nervous system.

The Trap: Evals Too Early

Here's the thing that really kills projects: writing evals before you have a working product.

Every hour spent writing evals is an hour not spent learning what your product actually needs. You're locking yourself into test cases for a system that doesn't exist yet.

The agent you're building now? It's not the one that's going to ship. It's going to be the second iteration. Or the fifth. The eval suite you wrote for version one is useless for version three. Worse than useless — it's weight you're dragging around.

The eval-industrial complex sold you on this idea that evals-first is disciplined. It's not.

The right sequence:

  1. Build something that runs
  2. Get it in front of real users (internal users are fine)
  3. Learn what breaks, what matters, what "good" actually looks like
  4. Then write evals to lock in that understanding

You can't evaluate what you can't run.

What Evals Are Actually Good For

I'm not saying evals are useless. They're critical — for model providers shipping foundation models. If you're training GPT-5, you need benchmarks. Even for AI engineers building products on top of those models, evals help with:

  • Catching regressions after you change something
  • Comparing model versions
  • Compliance checkboxes

That's it. They won't help you ship. They won't help you scale. They won't help you handle the thousand edge cases that only appear in production.

What Actually Gets You to Production

The market says: Evals → Observability → Production.

This is backwards. Here's what actually works:

Runtime → Production → (Evals + Observability)

The foundation comes first. Everything else is a support layer.

The foundation:

  • A runtime that handles the weird stuff. Concurrent users. Failure recovery. Long-running stateful processes that survive container restarts. Your agent isn't a microservice — stop treating it like one.

  • State management that doesn't disappear. Sessions that survive crashes. Context that carries across conversations. Memory that doesn't evaporate when Kubernetes decides to reschedule your pod.

  • Storage that lives with the agent. The agent's data — sessions, memory, knowledge — stored where the agent runs. In your cloud. Under your control. Send it to a third-party service and you've lost control of your product's brain.

  • Infrastructure you own. Your environment. Your data. Your competitive advantage.

The support layer (after you're running):

  • Observability for real production behavior — not synthetic test traces.
  • Evals to catch regressions — run them in CI, keep them lean.
  • Tracing to debug when things go wrong.

The support layer matters. But without the foundation, you're just testing in a notebook.

The Questions That Actually Matter

You have a working agent in a Python script. Great. Now answer these:

  • Where will it run?
  • Can it handle 100 concurrent users? 1,000?
  • What happens when a container crashes mid-conversation?
  • Is streaming smooth or do users watch a loading spinner for 10 seconds?
  • Where does the agent's memory live? Who owns it?
  • How do you deploy updates without breaking active sessions?

Evals don't answer any of these questions. The runtime does.

The Path Forward

I built Agno because I got tired of watching good agents die in the gap between "works in a notebook" and "runs in production."

Agno is a runtime for agents. It handles the stuff evals can't test:

  • Concurrent execution — thousands of users, isolated state
  • Persistent storage — sessions survive crashes, memory persists across conversations
  • Streaming that works — SSE out of the box, handles disconnects gracefully
  • Your infrastructure — runs in your cloud, data never leaves your environment

The eval-industrial complex had their shot. Two years. Billions in funding. The production numbers haven't moved.

Maybe it's time to focus on actually shipping.

Want to build with Agno?

Production means a working product deployed to your cloud — not a green eval suite running on your laptop.