What We Learned After a Month of Running an Agent Team

The Short Version

It worked. Then it broke. Then we fixed it — but not the way we expected.

A month ago we wrote about delivering software faster with an agent team — 14 AI agents, one human, 70/30 split. The original setup delivered. Scaffolding ran in parallel, Sprint 0 came together fast, the model proved agent-augmented engineering ships real code. Then the cracks showed.

What we landed on — and what we’ve been running on real client work for the past month — looks like this: 3 human pods + 10–12 AI agents. Each pod is a Team Lead plus two devs (the second dev rotates between BE, FE, QA, or BA depending on the project). The whole thing is pinned down by a shared UI framework, a shared boilerplate, and a deliberately small toolchain.

Outputs at the QA gate are now genuinely better than either a one-human-plus-many-agents setup or a traditional human team would produce. Here’s what got us there.

Where It Broke — Consistency Between Agents

The dominant failure mode was consistency drift between agents. Two agents working in adjacent modules would make compatible-looking-but-incompatible decisions: the same field named differently, the same error case handled three ways, the same auth check repeated with subtle variations.

In a human team, code review and conversation catch this. With a single human supervising 14 agents, by the time you notice, three more PRs have already built on the divergence.

What We Tried First — `AGENTS.md` + `CLAUDE.md`

The first mitigation was a pair of shared rule files at the repo root: AGENTS.md (the emerging universal convention every agent harness reads) and CLAUDE.md (Claude Code’s project-memory file). Together they acted as a single source of truth that every agent loads on every task — naming conventions, auth contract, RBAC patterns, error shape, the works.

It worked. For a while.

The failure mode it exposed: as context grew, agents started deviating from the rules. New modules brought new edge cases. Tool calls returned long outputs that pushed the rule file deeper into the prompt window. By the second week, agents would acknowledge a rule and violate it in the next turn. Classic context-window degradation, dressed up as “creative interpretation.”

Worse, when this happened, the human had to keep steering each agent back individually. The supervisory load climbed instead of dropping.

The Real Hazard — Context Poisoning

The worst failure mode is context poisoning: a wrong decision early in a session that the agent then defends and builds on for the rest of the conversation. Once it’s in the context window, every subsequent answer reinforces it. You don’t just have a bug — you have an agent confidently arguing the bug is correct.

That’s a whole new can of worms. Once we saw it happen, the one-human-supervisor model stopped being viable for anything beyond a prototype. The fix had to be structural, not just procedural.

The Pivot — 3 Human Teams, Not 1

We kept the agent loop and the role-based agents. What changed was the human side. Instead of one person trying to keep 14 agents consistent, the project is now run by 3 small human “team pods”, each owning a slice of the work.

Each pod is three people:

Role	What they do
Team Lead	Non-code work — planning, stakeholder, agent steering, decision ownership for the pod’s slice
Dev 1	Backend developer (always present)
Dev 2	Rotates by project — frontend, QA, or BA, depending on what the engagement actually needs

The dev composition is deliberately flexible. A backend-heavy integration project might have BE + QA. A user-facing build might have BE + FE. A discovery-heavy engagement might have BE + BA. The Team Lead stays constant; the second seat reshapes around the project.

Across all three pods, we now run 3 humans × 3 roles + 10 to 12 AI agents — a tighter agent roster than we started with, because we consolidated overlapping roles once human supervision was distributed. The 70/30 effort split from the original post still holds — what changed is that the 30% now spans three Team Leads working in parallel instead of one supervisor trying to be everywhere at once.

The non-obvious lesson: distributing the humans was what fixed the agent consistency drift. Three Team Leads each watching three to four agents in their own scope catches divergence in hours, not days. Each pod owns its slice of the contract; pod-to-pod alignment becomes a structured human conversation instead of one supervisor reading every PR. The agents didn’t change. The supervision pattern did.

UIPKGE and the Nuxt Boilerplate — Standards That Live in Code

Splitting humans into pods fixed supervisory drift. It didn’t fix the deeper problem: AI agents will write the same feature ten different ways unless the tools refuse to let them. AGENTS.md and CLAUDE.md are necessary, but they are not sufficient. The standards have to live in the code the agents are forced to use.

So we built two pieces of infrastructure, made both of them mandatory for humans and AI alike, and put real engineering hours behind them — because this is the single biggest reason the model now produces consistent output.

UIPKGE — Our Vue/Nuxt UI Registry

UIPKGE is our open-source UI registry for Vue and Nuxt, shipped via shadcn-vue. Today it ships 89 accessible primitives + 49 composed blocks, plus templates and a fully configured Nuxt boilerplate. MIT licensed. No npm dependency, no telemetry. The pitch on the site is unusually honest: “Components are the product — own the source, no semver, no upsell.”

Components install straight into the consuming repo with a single command:

npx shadcn-vue@latest add https://uipkge.dev/r/init.json

We started UIPKGE because the AI consistency problem was loudest in the frontend. Now any human or agent looking at the code sees the same primitives, the same naming, the same composition patterns, across every project we run. There is no “another way to do a data table” — there is the UIPKGE way, and it’s already built.

The agent payoff is sharp on its own, and there’s a second multiplier: UIPKGE ships /llms.txt and /docs.md alongside the registry. The agent doesn’t need to guess what’s in the library — it reads the LLM-formatted catalogue and picks the right primitive. Five different agents on five different sprints produce code that looks like it was written by one person, because the building blocks were.

The Nuxt Boilerplate — Every Standard, Pre-Wired

The second piece is our production-grade Nuxt 4 SaaS boilerplate — open-source on GitHub, MIT licensed, with a live demo you can click into without signing up for anything. Every new engagement clones it on day one. What’s already in the box:

Full UIPKGE design system — auth UI, marketing UI, dashboard shell, kanban, data tables, charts (vue-echarts), TanStack Form, TanStack Table, Tiptap editor — all themed light/dark/system via cookie (zero FOUC)
Auth via nuxt-auth-utils with 44 OAuth providers ready to swap (GitHub wired by default; switching to Google or Microsoft is three lines + two env vars) — plus magic-link sign-in, demo mode, and an admin-bootstrap env var
Drizzle ORM + Postgres with versioned migrations — works against Neon, Supabase, Railway, RDS, or local
Polar.sh billing — checkout, customer portal, signature-verified webhooks (Polar is the only writer to the subscriptions table)
Resend email, @nuxtjs/i18n with optional CDN sync, Sentry + PostHog + Axiom observability — each one conditionally registered, never bundled when off
Typed API envelope — every server route returns { ok: true, data } or { ok: false, error: { code, message } }. Structured error codes. RBAC re-read from DB every call — session cookie role is never trusted for authorization
DX: ESLint flat config, Lefthook git hooks, knip (dead-code detection), jscpd (copy-paste detection at 2.5%), zod-validated env at boot that fails loud on misconfig

Every external integration is env-gated — clone the repo, set one session secret, and you have a working app in demo mode with no DB, no OAuth app, no API keys. Flip env vars when you’re ready to enable real services.

git clone https://github.com/uday-a/nuxt-boilerplate my-app
cd my-app && npm install
echo "NUXT_SESSION_PASSWORD=$(openssl rand -base64 32)" > .env
npm run dev

A human cloning the repo recognizes the layout in 30 seconds. An agent has a dramatically narrower target to hit — the structure is fixed, the choices are already made, the patterns are demonstrated by working code right next to whatever the agent is about to write. Most “where does this belong?” decisions an agent would otherwise improvise are already answered by the file next door. The typed API envelope alone eliminates an entire category of agent drift — there is exactly one way to return a response.

Why the Combination Is What Changed the Output

UIPKGE + the boilerplate together cut the number of moments where an agent could miss the standards by an order of magnitude. The framework constrains what gets built; the boilerplate constrains where it goes and how it wires up. Between them, the variation space the agent has to navigate shrinks down to the actual problem — never the scaffolding around it.

This is what we mean when we say standards have to live in code. A claude.md file says “do it this way.” The boilerplate makes “this way” the only way that compiles.

The Next.js Paradox

A surprising lesson from this: the more popular the framework, the worse the consistency from agents.

We hit this hard with Next.js. Agents are deeply trained on Next.js and have enormous amounts of conflicting patterns in their training data — App Router vs Pages Router, multiple data-fetching styles, three different ways to do auth, dozens of “best practices” from different eras. Ask an agent to build the same feature five times in Next.js and you’ll get five subtly different answers. The signal-to-noise ratio in its head is bad.

shadcn primitives are the opposite — small surface, recent, more consistent training. The agents deviate far less. Wrapping our UI layer in UIPKGE on top of shadcn was as much a standardization decision as a design-system decision.

This is not an anti-Next.js post. Next.js is excellent. But if your team is leaning hard on agents, framework choice is now also an agent-consistency choice, and the trade-off is real.

One Model, One Harness — Why We Standardized on Claude + OpenCode

The same logic applies to tooling. We standardized on Claude as the primary model and OpenCode as the agent harness. Claude carries the bulk of agent work and is the baseline for evaluating anything new. OpenCode is what our team is fluent in, and the orchestration loops are specific to its harness — so when we experiment with a new model, we plug it into OpenCode rather than learn a new harness on top of a new model.

Sounds boring. That’s the point. A consistent harness + a consistent model + a consistent framework + a consistent boilerplate = a small enough variation space that humans and agents stay aligned. The novelty budget gets spent on the problem, not the toolchain.

That doesn’t mean Claude-only. The team experiments — but the split is heavily weighted toward the default:

Claude — 55% (default)
Codex — 30%
Kimi — 5%
GLM — 4%
DeepSeek — 4%
Minimax — 2%

Claude is the bulk because it’s what we trust for production work. Codex is the second-largest slice because it’s the most useful alternative when we want a different angle on a hard problem. The Kimi / GLM / DeepSeek / Minimax tail is where we experiment with new releases — never on critical-path work, always inside OpenCode so the harness stays the same. The constants are the framework, the boilerplate, and the harness. The model can vary as long as the surrounding system doesn’t.

What the Outputs Look Like Now

The point of all of this — 3 human pods, UIPKGE, the boilerplate, a standardized harness, a known model mix — is the output. Here is where the pivot has actually paid off:

Code reaches QA in a verifiable, tested state. What lands at the QA gate is genuinely ready for it. Pull requests come in with their tests, their migrations, their docs, and they pass the suite. We are no longer triaging “looks done, isn’t done” PRs.
All QA tests are passing on the work we ship. The combination of agents writing the feature, agents writing the regression tests, and humans verifying the contract means the green checkmark actually means green.
Quality is up. Edge cases — auth boundaries, empty states, race conditions, error paths — get handled the first time. We’re not catching them three sprints later.
Quantity is up too. This is the part that was harder to predict. The pod model didn’t just stabilize the agent output; it produced more of it, because each pod operates in parallel without colliding with the others.
No more “the same thing, five different ways.” The agents stopped re-inventing patterns that already exist. UIPKGE plus the boilerplate plus the per-pod Team Lead means the obvious solution is the one that ships, every time.

Compared with the one-human-plus-many-agents setup, the deltas are real on both axes — more code, better code, fewer surprises. None of which is novel as an engineering goal; what’s new is that we got there with 3 humans instead of 9 or 12, and the agents did the bulk.

What We’re Still Working On

The model is good. It is not finished. The next things we are actively figuring out:

Pod-to-pod alignment at scale. Three pods sharing a single product still needs discipline at the seams. Today that’s a daily sync and shared API contracts; the next step is more structured cross-pod review.
Detecting context poisoning earlier. We catch it now because the Team Lead is in the loop. We’d rather catch it inside the agent’s own session — a heuristic, a checkpoint, a “are you sure?” gate before a bad decision compounds.
Honest, comparable numbers. Velocity, rework rate, and human hours per story are real and tracked. We’ll publish the comparison in the next post once we have a full multi-pod sprint to put alongside the original setup.
What changes at four pods, or five. The model is stable at three. We don’t yet know where the next structural break is.

The Takeaway

A team of AI agents can ship production software. A team of AI agents with one human supervisor can ship it for a while, until consistency drift and context poisoning catch up. A team of AI agents with three human pods, a shared UI framework, a shared boilerplate, and a deliberately boring toolchain can ship it sustainably — and produce more code, better code, and fewer surprises in the process.

That last sentence is the entire post. Everything else is the receipts.

At OLabs, we deliver production software faster with agent-augmented engineering teams — without trading away compliance, quality, or the audit trail. Indian-market expertise, healthcare-grade rigor, three human pods, a stack of standards that travel with us. Talk to us to see what this looks like for your product.