References
This page collects sources, evidence levels, and experiment details for the Meta-Programming documentation.
Evidence Levels
| Marker | Level | Meaning |
|---|---|---|
| ๐ข | Proven | Our experiment, our data, measured result |
| ๐ก | Trusted source | Anthropic, Microsoft Research, peer-reviewed โ we read the primary source |
| ๐ | Community reports | Widely observed, not independently verified by us |
| ๐ด | Unverified | Heard, not checked |
| โช | Opinion | Our synthesis, reasoned but not proven |
Our Experiments
| # | Description | Key Finding | Pages |
|---|---|---|---|
| 1 | EventBus refactor A/B (5 variants, 509 TS files, DDD, Cloudflare Workers) | Process beats information: ScoutโSpecโWorkerโReview pipeline ($8.45) outperformed raw context injection ($2.84 fail, $14.30 pass). Code maps hurt: $9.99 fail vs $6.63 pass without map. | pipeline, principles |
| 2 | Telegram bot without spec (support reply feature) | Two consecutive deterministic failures without spec. First corrupted existing handler, second missed intent entirely. | specification, principles |
| 3 | 24-file type refactor | 253 tests, zero regressions, $5.50 API cost. Review required three iterations (both failure types were missing pre-flight checks). 106-turn task didnโt contaminate subsequent tasks due to context reset. | index, verification, pipeline |
| 4 | Spec review with barrel re-export | Agent proposed barrel re-export during spec review, recognized it was unnecessary when asked to explain. Spec review is cheaper than code review. | specification, principles |
| 5 | Pipeline variant comparison | Structured pipeline: correct on first attempt, $8.45. Raw prompt: wrong, $2.84. Sequential with context: correct, $14.30. With code map: wrong, $9.99. | pipeline |
| 6 | KB A/B test (generic vs structured) | Generic Sonnet said โgive it a code map.โ KB-loaded agent flagged exploration-vs-exploitation paradox with prior session evidence before writing code. | index, principles |
| 7 | Edit tool investigation | Persistent error pattern traced to our own extension, not the platform. Post-fix benchmark: 7.1% errors, 1.1% data loss. | index |
| 8 | Model evaluation (thinking levels) | Thinking level acts as compliance-to-conviction dial. Soft sycophancy identified: agent says no while providing implementation. | index |
| 9 | Opus degradation incident (April 2026) | Read:Edit ratio dropped from 6.6 to 2.0, thinking depth fell 67%, costs spiked 80ร. Three-day silent degradation with no API-side signal. | verification, principles |
External Sources
Large-Scale Studies
-
LinearB (2026). โThe Real Impact of AI on Developer Productivity.โ 8.1M pull requests, 4,800 teams, 42 countries. AI-generated code: 1.7ร more review revisions, 4.6ร longer review wait, 32.7% acceptance rate (vs 84.4% human). Developers feel 20% faster; tasks take 19% longer end-to-end. ๐ก
- Referenced in: index, verification, principles
-
Stanford Meta-Harness Study (2026). Changing the harness around a fixed LLM produces a 6ร performance gap on the same benchmark. Automated harness optimization searched configurations the way gradient descent searches weight space. ๐ก
- Referenced in: index, principles
-
ETH Zurich (Feb 2026). 138 real-world tasks across 3 models. Auto-generated context files reduced task success by 3%. Human-written boundaries improved success by 4%. ๐ก
- Referenced in: specification, pipeline, principles
-
Sonar (2026). Survey of 1,000+ developers. Only 48% verify AI output before shipping. ๐ก
- Referenced in: index
-
Microsoft Copilot Study (2026). 10-month study, 878 pull requests. โThe bottleneck moved from typing speed to knowledge, judgment, and ability to articulate tasks.โ ๐ก
- Referenced in: index
-
BSWEN (2026). 133 cycles, 42 development phases, four models in strict isolation. GPT caught Python security issues Claude missed. Claude caught architectural violations GPT normalized. Each model had different blind spots. ๐ก
- Referenced in: verification, principles
-
Bamberg/Heidelberg (2026). Systematic analysis of 2,926 repositories across Claude Code, GitHub Copilot, Cursor, Gemini CLI, Pydantic AI. Converging on identical patterns independently. ๐ก
- Referenced in: index
Research Papers
-
Tsinghua NLAH (March 2026). โNatural-Language Agent Harnesses.โ Harness behavior externalized as โa portable executable artifact in editable natural language.โ ๐ก
- Referenced in: index
-
Microsoft Research RiSE (March 2026). Lahiri et al. โIntent Formalizationโ named as a grand challenge for 2026. Intent gap: the semantic distance between what a developer means and what the system does. ๐ก
- Referenced in: index, specification
-
Allard et al. ERL (March 2026). Agents with heuristics extracted from prior trajectories outperformed ReAct baselines by +7.8%. โHeuristics provide more transferable abstractions than few-shot prompting.โ ๐ก
- Referenced in: index
-
ExpeL (Andrew Zhao et al., 2023). Experience Learning: three-stage self-improvement (act โ reflect โ extract). On HotpotQA and ALFWorld, ExpeL agents improve with each batch of trajectories. ๐ก
- Referenced in: self-improvement, landscape
-
Chroma MECW (2025). Maximum Effective Context Window. 18 frontier models tested. Universal degradation; effective window as low as 1-2% of advertised maximum in some conditions. ๐ก๐ด (specific numbers unverified by us)
- Referenced in: context-engineering, principles
-
Reflexion (NeurIPS 2023). Verbal reflection with episodic memory raised HumanEval from 80% to 91%. ๐ก
- Referenced in: landscape
-
ADAS (ICLR 2025). Automated Design of Agentic Systems. The agent designs its own pipeline structure. ๐ก
- Referenced in: landscape
-
Gรถdel Agent (ICLR 2025). Recursive self-modification via confidence-based logic. ๐ก
- Referenced in: landscape
-
DKB (January 2026). Deterministic Knowledge Bases. AST graphs beat vector RAG and LLM-generated knowledge graphs for code navigation. ๐ก
- Referenced in: landscape
-
AMBIG-SWE (ICLR 2026). Benchmark for ambiguity detection in software engineering tasks. ๐ก
- Referenced in: specification
-
ACE Framework. Agentic Context Engineering. Memory scoring: each unit carries a score that updates on use. Quality saturates at ~7 governed memories per entity across 500 adversarial queries. ๐ก
- Referenced in: self-improvement
-
Pavlyshyn (Jan 2026). History of constrained natural language in programming: COBOL, SQL, Simula. 60-year progression. ๐ก
- Referenced in: specification
Vendor Documentation & Posts
-
Anthropic โ โBuilding Agents with Skills.โ Skills as zero-cost-until-invoked context units. Self-evaluation bias documented explicitly. Claude Code architecture: tiered context loading, compaction, coordinator mode. ๐ก
- Referenced in: index, context-engineering, verification, principles
-
DSPy (Stanford, 33K stars). Prompts as learnable parameters. BootstrapFewShot, MIPROv2 search language space automatically. ๐ก
-
Pydantic (2026). Analyzed 4,668 pull request comments, extracted 150 AGENTS.md rules. Engineering taste compiled into agent instructions. ๐ก
- Referenced in: specification
-
Promptfoo (acquired by OpenAI, March 2026, $86M). 350K developers. Trajectory assertions:
tool-used,tool-args-match,tool-sequence,step-count,goal-success. ๐ก- Referenced in: verification, landscape
-
spec-kit (79K stars). GitHubโs 5-command SDD workflow: constitution โ specify โ plan โ tasks โ implement. 20+ agents. ๐ก
- Referenced in: specification, landscape
-
Kiro (Amazon). IDE built around requirements โ design โ tasks. Specs live in project root, evolve with codebase. ๐ก
- Referenced in: specification, landscape
-
AGENTS.md. 60,000+ repositories. Linux Foundation / Agentic AI Foundation standard. Cross-tool coordination protocol with Anthropic, Google, Microsoft, Cursor backing. ๐ก
- Referenced in: index, specification, landscape
-
Augment Code. Single-writer rule for hotspot files. Sequential merge strategy. ๐ก
- Referenced in: pipeline
-
OpenTelemetry GenAI SIG. Semantic conventions:
gen_ai.chatfor LLM calls,agent.invokefor agent steps,tool.executefor tool calls. Datadog, Honeycomb, New Relic support natively. ๐ก- Referenced in: verification
-
Simon Willison (@simonw). Rigorous public reference on agentic engineering. Tests are free and mandatory. Agents follow existing code patterns. ๐ก๐
- Referenced in: context-engineering, landscape
-
Martin Fowler. Spec progression: spec-first โ spec-anchored โ spec-as-source. Maturity curve mapping. ๐ก
- Referenced in: index
Community Reports
-
Amazon deployment (2026). 21,000 agents, 80% weekly usage. 4 Sev-1 incidents in 90 days. 6-hour outage, ~6.3M lost orders. ~30,000 layoffs concurrent with AI scaling. ๐
- Referenced in: index, verification, principles, landscape
-
Self-improvement tools (March 2026). Three independent projects (skill-loop, selfwrite, iterate) shipped in the same week without coordination. All focused on instruction improvement, not weight modification. ๐
- Referenced in: self-improvement, landscape
-
Edit tool failure rate. Agents express edits as text replacements, which break on whitespace drift, formatting changes, and multi-cursor ambiguity. Documented across multiple tools. ๐
- Referenced in: landscape
-
Spec sizing (900-1600 tokens). Community-reported sweet spot for structured quick-dev specs. Below 900: ambiguity risk. Above 1600: tail instruction degradation. ๐
- Referenced in: specification
-
Rory Teehan. Structured error logging: what happened, why, what should have happened. ๐ก
- Referenced in: self-improvement
People to Follow
- Andrej Karpathy (@karpathy). Autoresearch: 700 commits in two days, โ11% validation loss. Memory should be tree-structured, not flat. ๐
- Mario Zechner (@badlogicgames). Built Pi. When agents self-praise, human review becomes the bottleneck. ๐
- Harrison Chase (@hwchase17). LangChain. What does production agent orchestration actually look like at scale. ๐
This reference list is maintained alongside the documentation. Sources marked with evidence levels as used in the main text.