Pipeline: Scout → Spec → Plan → Workers → Review → Lessons

A structured workflow outperforms adding more information. By a measurable margin. Six stages, clear role separation, and deliberate cache strategy are what make multi-agent pipelines work in production.

Why process beats context

Giving an agent more context when it gets something wrong is the natural instinct. It is also wrong.

We ran five variants on the same production EventBus refactor (509 TypeScript files, DDD architecture, Cloudflare Workers) to measure what actually drives correctness.

Run	Setup	Cost	Architecture	Review
#1	Raw prompt + code map	$2.84	❌ Wrong	—
#2	Raw prompt, no map	$8.45	✅ Correct	—
#3	Detailed spec + code map	$9.17	✅ Correct	—
#4	Pipeline, no map	$6.63	✅ Correct	Pass, first try
#5	Pipeline + code map	$9.99	✅ Correct	Fail → fix

The pipeline run (#4) was the cheapest correct solution and the only one that passed review on the first attempt. Adding a code map to that same pipeline (#5) cost 51% more and introduced a review failure. The map gave the agent a fast path to the highest-ranked files, which meant it stopped exploring and missed scope. Without the map, it grepped broadly, found the full picture, and built accordingly. See Experiments.

Auto-generated context produces no significant change in success while increasing cost by about 20%, across 138 real-world tasks and three models; human-written boundaries showed only a small, non-significant gain. Same mechanism at a different scale: pre-loaded navigation shortcuts substitute for actual exploration.

The pipeline adds roughly 10× the latency of a raw prompt for simple changes. But trace a full week of work: two raw failures per feature, one pipeline success. The pipeline reaches working code faster. The overhead isn’t in the pipeline. It’s in the failures it prevents.

The pipeline: what each stage does

The six stages map onto a research-to-delivery arc. Not every task needs all of them.

Task	Minimum stages
Bug fix, known cause	Worker → Pre-flight
Single-file change	Spec → Worker → Review
Multi-file feature	Spec → Plan → Workers → Review
Cross-domain refactor	Scout → Spec → Plan → Workers → Review → Lessons

Scout. A lightweight agent (Haiku-class) reads the codebase and writes a structured briefing. It never touches anything. Its job is to compress signal so downstream workers don’t pay full-context prices for orientation. In our architecture, scout output is cached by git commit hash, so if nothing changed, the briefing is free.

Spec. A structured specification is written before any implementation begins. In one tech-lead run, a clarifying question during spec review revealed the feature being designed was already configured in the system. What looked like a multi-day implementation reduced to a one-line config change. Without spec review, a raw agent on the same task implemented the entire feature from scratch, duplicating existing work. See Specification for structure.

Plan. The spec is decomposed into atomic, sequenced tasks with explicit acceptance criteria. Ordering matters: dependencies become explicit, so workers don’t block each other and parallelism opportunities are visible upfront.

Workers. Individual subagents execute tasks in isolation. Context does not bleed between workers: a Task 6 that ran 106 turns did not degrade Task 7 in one pipeline run, because Task 7 started clean. This isolation is the mechanism that makes sequential multi-task work reliable at scale.

Review. A separate agent with no implementation history reads the output and checks it against the spec. Review and implementation sharing context is the most common pipeline mistake. Self-evaluation bias is real. Agents rate their own work highly even when it is broken (Anthropic, 2026). A clean reviewer is an honest reviewer. See Verification for layered pre-flight requirements.

Lessons. After merge, a final agent extracts what changed, what failed, and what patterns should persist. Per-session extraction is cheap enough for a Haiku-class model. Cross-session promotion (deciding which patterns generalize) needs a stronger model; cheap promotion produces rule pollution. This stage closes the loop: execution experience becomes structured knowledge that changes how future workers reason. See Self-Improvement.

This parallels Claude Code’s Coordinator Mode (Research → Synthesis → Implementation → Verification). The difference: CC runs these as in-process phase shifts within a long session; our pipeline runs each as a separate process with a context reset. Both work. Clean processes are simpler to debug.

Orchestrator never reads files

The orchestrator delegates. It does not explore.

If the orchestrator’s messages contain file contents, something is wrong. It should route work: task descriptions, dependencies, acceptance criteria. Nothing else.

In our architecture the orchestrator is a skill running in the parent process; workers are separate agent processes. Pi’s subagent tool is unavailable inside another subagent, so orchestration must happen from the top level. We learned this the hard way. tech-lead couldn’t delegate to scout until we restructured it as a skill.

An orchestrator that reads files defeats itself. It fills context with details that belong to workers. It becomes a sequential bottleneck. Every read is a turn spent on work that should be parallelized.

This rule degrades under pressure. One debugging session: the orchestrator called scout five times (the skill says “NEVER call scout twice”), then started reading files directly. Caught mid-session. The pattern is consistent, when delegation can’t solve the problem, the orchestrator reverts to direct exploration. Sometimes that’s correct. Novel debugging may need human collaboration, not longer chains.

Coordinator patterns: continue vs spawn fresh

The continue-vs-spawn decision is the most consequential call the coordinator makes.

Continue (fork the parent) works when the next task is a continuation of the same artifact: reviewing what the parent just wrote, extracting memory from a completed session, running a second pass on the same output. Fork shares context cheaply and maintains narrative continuity.

Spawn fresh (clean agent) works when the next task is independent: a worker in a parallel batch, a verification pass that needs fresh eyes, any task that should not be influenced by prior context. Clean context is a feature, not a cost.

The failure modes are symmetric. A fork that should have been clean carries invisible priors. The worker “knows” things from the parent’s exploration that bias its approach. A clean agent doing continuation work has to rediscover everything the parent already built.

Model selection per stage is part of the coordinator decision, not an afterthought. A pipeline run with Haiku scouts and Sonnet workers cost $4.77 total and found the root cause. A solo Sonnet session on the same task (no pipeline structure) burned $6.66+ and failed, looping on wrong hypotheses for ~86K tokens before the user intervened. The savings came from the structure, not just the cheaper scouts.

A fork is not a copy. It’s a continuation from a shared prefix. When you fork an agent, the child receives a byte-identical system prompt, tool list, and model assignment. Specialization happens through the user-message directive only. Everything before the fork point is shared state.

A clean agent starts cold. No inherited system prompt, no tool list, no prior turns. It pays more to orient, but it cannot be contaminated.

Task type	Agent type
Review parent’s work	Fork
Extract memory from session	Fork
Independent worker, parallel batch	Clean
Open-ended exploration	Clean
Verification pass	Clean

Contamination is the more dangerous direction. A fork carrying stale priors solves the wrong problem confidently. A clean agent doing continuation work signals its ignorance through questions. Visible. Correctable. When uncertain, default to clean.

Fork cache mechanics

Forking runs on Anthropic’s prompt cache. The child’s first request shares a byte-identical prefix with the parent. Same system prompt, same prior turns. The API matches, returns cache_read, skips reprocessing. On 100K+ tokens of context, the fork’s first call costs ~10% of cold start.

Claude Code’s fork implementation preserves this by design: all children receive identical placeholder tool results across every fork, and only the final user-message directive differs per worker. This is what makes multi-worker fan-outs economically viable. The expensive prefix is paid once by the parent and shared by all children.

The condition is strict. Byte-identical. System prompt, tool definitions, model. All must match exactly. One character of whitespace? Full reprocessing. Different model on a fork? Cache gone. This is why standardizing prompts across pipeline stages isn’t style. It’s economics.

A long-running orchestrator keeps the cache warm. Idle time between dispatches isn’t waste. It’s the shared prefix that makes the next fork cheap.

Subagent economics

A skill runs in the orchestrator’s process, shares its cache, and pays zero spawn overhead. An agent is a separate process: it starts cold, reloads tools, and rebuilds context from scratch.

The cache math is concrete. Fourteen subagents processing 50K tokens each generate 700K cache_create tokens and zero cache_read. The same work done in one long session with context sharing produces an 11:1 read-to-create ratio. The parallel workload costs an order of magnitude more in cache terms than the sequential session, before accounting for spawn overhead.

This is not an argument against subagents. Isolation and parallelism justify the cost for complex tasks. But for lightweight, repeatable operations (search, format, summarize), a skill is almost always the right choice.

The decision:

Task needs isolation, is long-running, or risks runaway token consumption → Agent
Task is lightweight and repeatable → Skill
Task is a natural continuation of the parent → Fork

Verification is often where the model breaks down, not execution. Augment Code’s single-writer rule for hotspot files and sequential merge strategy reflects this: when verification is the bottleneck, adding more parallel execution makes things worse, not better. Azure’s multi-agent taxonomy names five patterns (Sequential, Concurrent, Group Chat, Handoff, Magentic) and applies the same rule across all of them: use the minimum complexity that solves the problem reliably.

Parallel decomposition has a measured tax

The case for sequential workers sharing a spec, rather than parallel workers coordinating through messages, now has a number on it.

A controlled 2026 experiment on 51 class-generation tasks held the spec constant and varied the execution shape. Single-agent passed at 89%. Two-agent parallel decomposition of the same class — one worker building the list-based half, another the dict-based half — dropped to 58%. That’s a 31-point gap attributable entirely to running the work in parallel on shared internal state. The decomposition is additive: 16 points from coordination overhead (two workers making independent decisions that had to be reconciled), 11 points from information asymmetry (each worker seeing a subset of the context). Running an AST-level conflict detector at 97% precision between the workers moved the score by zero. What recovered the ceiling was restoring the full spec to the merging agent at the end.

The scope of this finding matters. It’s about parallel decomposition of one tightly-coupled artefact with shared internal state. It doesn’t apply to independent subsystems, sequential pipelines, or role-specialised agents (scout, worker, reviewer) that don’t compete for the same state. Our pipeline is structurally safe from this failure mode — workers hand off artefacts instead of coordinating on shared memory. Teams considering parallel workers on the same module should price in the 16-point floor. “Agent teams” as currently marketed by several tools are parallel decomposition with a better name, and the measured cost lands in the same place.

The community has noticed. The common observation on the Claude Code subreddit: agent teams are “expensive subagents with better marketing.” Communication overhead overwhelms the team leader’s context window, idle notifications accumulate, nothing beats simple subagent spawning for current implementations. That matches our architecture: sequential subagents plus a shared spec artefact beats a parallel team coordinating through messages.

The pipeline’s advantage shrinks as models improve

A large trajectory study (North Carolina State, 9,374 agent trajectories, 19 agents) split “what makes an agent succeed” into task factors and agent factors. The successful behavioural pattern — gather context before editing, invest in validation — turned out to be agent-determined, not task-adaptive. Good agents do this regardless of the task. Framework prompts can still influence tactics, but the effect narrows with each generation of base model.

The practical consequence: pipeline structure pays back less as the underlying model gets stronger. Our tech-lead skill’s alpha over a raw Opus 4.7 prompt will be smaller than its alpha over a raw Sonnet 4.6 prompt, and smaller still against whatever ships in six months. That doesn’t invalidate the pipeline — it shifts what to measure. Track the gap between pipeline and raw-prompt outcomes over time. When the delta narrows from +30 points to +10, the pipeline hasn’t broken; the model has internalised the behaviour. What’s left is where structure still buys something: boundaries the model won’t invent for you, reviewers with fresh context, artefacts that persist across sessions.

Recovery layer: Plan → Implement → Test → Score → Fix → Retry

Our pipeline ships an extraction stage at the end of a task and no recovery stage inside one. That gap has a named pattern now, with four independent implementations behind it.

The pattern is the Ralph-loop — fresh context per iteration so failures don’t bleed across attempts, external memory in files plus Git so state survives the reset, one item per loop so each iteration fits the context window, and a machine-verifiable completion criterion so success isn’t a self-report. Four shipped reproductions of the same four constraints landed by May 2026: Huntley’s original blog and ai-assisted-software-development.com, LoopTroop (“councils plan, Ralph loops recover, OpenCode worktrees ship”), ralph-claude-code (recovery layer alone with cost caps and an MCP audit mode), and lopi (Rust plus tokio, with the most architectural detail visible in source). Four sightings is enough that the pattern stops being a candidate extension and starts being a structural category. 🟡

The loop is concrete enough to read in code. One implementation (lopi) runs Plan → Implement → Test → Score → Fix-in-place → Retry, with git reset --hard on every failed attempt so the next iteration starts from a known commit rather than a half-broken working tree. Hard turn limits prevent runaway, a diff-scope check rejects edits outside allowed directories before they hit the working tree, the last error message is injected into the next attempt’s plan as adaptive context, and model routing escalates from a cheap model to a stronger one only after a failure rather than by default. On terminal failure — loop budget exhausted with no passing run — a post-mortem stage distills the failure into a single imperative constraint stored alongside the task for retrieval on the next similar run. 🟡

Recovery sits in a different category from our Lessons stage even though both consume failures. Lessons runs post-merge across sessions and promotes slowly through a stronger model; recovery runs in-task within one session and iterates fast on a still-broken artefact. The two are complementary rather than redundant, which is why one orchestrator can ship both layers plus a quality gate (LESSON_QUALITY_GATE = 0.6) between them so a sub-threshold run never writes a lesson. Whether our pipeline should adopt the recovery layer is an open question, because P#5 (atomic tasks, fail-then-respec) is structurally adjacent and possibly sufficient at our task shape — the pattern works at n=4, but our task profile may already cover the gap by resetting context between attempts rather than retrying inside one. ⚪

Settled questions

Pre-flight is a permanent fixture. In Experiment 3, review agents failed on first attempt every time. Zero first-attempt pass rate. Both failures were deterministically checkable: missing imports, type errors. Adding tsc --noEmit before the review agent ran eliminated both. Pre-flight is now part of every pipeline run. The remaining scope question: in Experiment 5, tsc passed on code with a runtime logic error. A || operator split across lines by the edit tool, invisible to the compiler. Deterministic pattern checks for common edit-tool artifacts are candidates for additional pre-flight steps; false-positive rate is unmeasured.

Parallel decomposition of shared-state code is not worth it. Quantified above: 16-point coordination floor, additive with information asymmetry. AST conflict detection doesn’t move it. Spec completeness does. Parallel workers on independent modules remain viable; parallel workers on the same class or shared state are not.

Open questions

Human-in-loop placement. One debugging session: the breakthrough came from the user asking a direct question. Not from delegation. Three wrong hypotheses. User intervened. Fixed. We have no pattern for “agent requests human input at step N” that doesn’t break autonomous flow. Where should the intervention threshold sit? No data.
Pipeline crash recovery. Worker dies mid-task, progress is lost. Context reset between workers is a feature. But it means no checkpoint within a task. CC’s PreCompact/PostCompact hooks address orchestrator memory loss, not worker crashes. Checkpointing is undesigned. The experiment: write worker state to .pi/pipeline-state/ after each phase, test recovery from simulated kills. Mutated by Ralph-loop convergence (May 2026): the pattern question is settled — four independent implementations now agree on fresh-context-per-iteration plus external-memory-in-files plus git-rollback-per-attempt. What’s open for our pipeline is implementation choice, not pattern discovery. lopi is a usable reference architecture; the remaining decision is whether in-task retry buys us more than respec-from-clean-slate at our task shape.

Dotex