Scaling Agentic Software: Part 1

What is the simplest architecture for running a multi-agent system at scale?

I want to deploy agents as a real service. Multi-user, RBAC, JWT-based auth. Sessions, memory, and knowledge backed by a database. Horizontally Scalable. Able to serve thousands of concurrent requests. The kind of product you'd actually ship to users.

Could the answer be: a FastAPI app and a Postgres database?

So I spent some time building one to find out. 14 agents, 11 multi-agent teams, 5 workflows. Hundreds of tools, approvals, evals, schedules. All running in a single FastAPI process against a single PostgreSQL database. It's open source: Demo AgentOS.

I'll walk through the architecture in this post. In the next one we'll dive into what breaks when you push it.

The Bar

"Scale" gets thrown around quite a bit. In this case, scale means breadth. The surface area of a real product. Every concern a CTO would actually need to address before shipping a product to users:

Multi-user and multi-tenancy. Every user gets their own sessions, memory, and context. The system isolates every resource an agent touches, across every user, on every request.

Note: Context bleeding is a data breach, not a bug.

Auth and RBAC. JWT verification, role-based access control, scoped permissions. This applies to the API layer, the agents, the tools they call, and the data they can access. Dev and production should have different security postures.

Real persistence. Sessions, memory, and knowledge stored in a database, with regular backups and data access policies. Everything needs to comply with user-data protection laws like GDPR and CCPA.

Serving requests at scale. The system should be able to handle thousands of concurrent requests. Streaming responses should be held open. Background work (memory extraction, summarization, learning) running alongside the primary model call. All of it competing for the same HTTP transports, connection pools, and database connections. The hard part is not serving one request. It is serving the thousandth one without stalling the ninth one.

Observability. Tracing every agent run, every tool call, every delegation in a multi-agent team. When something goes wrong at step 7 of a 12-step workflow, you need to see exactly what happened and why.

Governance. Layered authority over what agents can do. Some tools run freely. Some need user approval. Some need admin sign-off. Approval flows, audit trails, and the ability to pause execution mid-run.

Reliability and evals. Agents are testable software. You need smoke tests, tool call validation, LLM-judged accuracy, performance baselines. Without evals, every change is a guess.

If this is the bar, the question is: what's the simplest architecture that clears it?

The Architecture

One FastAPI process. One Postgres database. That's it.

The FastAPI app serves 14 agents, 11 multi-agent teams, 5 workflows using REST endpoints. Every request is a POST, every response is a server-sent event stream.

The database does more than you'd think. The Postgres database stores agent sessions, user memory, knowledge contents, learnings, schedules, and eval results. Pgvector handles embeddings for knowledge bases.

The Components

The 30+ components in the AgentOS showcase different agentic patterns.

Demo AgentOS

Some showcase HITL patterns. The Helpdesk agent wraps three tools: one that requires operator confirmation before restarting a service, one that pauses for user input on ticket priority, one that executes outside the agent runtime. The Approvals agent uses Agno's @approval decorator for blocking approval gates and audit-trailed operations. Both agents pause execution mid-run and resume on approval.

Some showcase guardrails. The Helpdesk agent has three pre-hooks: OpenAI moderation, PII detection, prompt injection detection. It also has a post-hook that scans responses for secret patterns (API keys, connection strings, SSNs) and rewrites them before they leave the process. An audit log hook records every run for compliance.

Some showcase multi-agent teams. Pal is a personal knowledge agent with five specialists. Dash is a data analyst with an Analyst/Engineer split. Coda is a coding agent with five specialists including a Planner and a Triager. The Research and Investment teams each ship in four modes (coordinate, route, broadcast, tasks) so you can see how the same set of members produces different behavior under different coordination patterns.

Some showcase step-based workflows. Morning Brief gathers calendar, email, and news in parallel and synthesizes a briefing. AI Research runs four parallel researchers and synthesizes their findings. Content Pipeline does parallel research plus a loop that iterates until an editor approves. Support Triage classifies a ticket, routes it to a specialist, and escalates if severity is high.

Some showcase state management. Taskboard demonstrates session state with agentic state updates. Injector demonstrates dependency injection through RunContext. Compressor demonstrates tool result compression with a cheaper model.

Some showcase scheduling. Morning Brief runs every weekday at 8am ET. AI Research runs every day at 7am UTC. The Scheduler agent lets users create, list, disable, and delete schedules at runtime through natural language.

The point is not that you need all of these. The point is that a single FastAPI process can host them without the architecture getting complicated.

Governance as First-Class Infrastructure

Three layers of governance sit on top of every agent.

Pre-hooks run before the model sees the input. Moderation, PII detection, injection detection. If any hook raises, the request is rejected before a single token is generated.

Approval gates pause the run mid-execution. A tool decorated with requires_confirmation=True or @approval streams a RunPaused event to the client with the tool name and arguments. The client shows the user an approve/reject UI. On approval, the run resumes from where it paused. This works because the session state is durable (stored in db).

Post-hooks run on the output. The Helpdesk agent has an output guardrail that scans responses for secret patterns and rewrites them before they leave. Every run is audit-logged through a separate hook.

What's Not Here

No message queue. No worker pool. No separate vector database. No Redis. No microservices. No orchestrator service standing in front of the agents. No separate auth service.

Could you add them? Sure. Are they necessary to clear the bar I defined? Not yet. The point of this exercise is to find out where the simple architecture breaks, so the next decision (what to add) is grounded in actual load, not in speculation.

What's Next

Part 2 is what breaks when you scale this.

I'm going to load test it. Thousands of concurrent requests. Streaming responses held open. Background memory extraction competing with primary runs. Connection pools under pressure. I expect to find a few obvious bottlenecks and a couple of surprising ones.

Links: