Dotex

About

References

This page collects sources, evidence levels, and experiment details for the Meta-Programming documentation.

Evidence Levels

Marker Level Meaning
๐ŸŸข Proven Our experiment, our data, measured result
๐ŸŸก Trusted source Anthropic, Microsoft Research, peer-reviewed โ€” we read the primary source
๐ŸŸ  Community reports Widely observed, not independently verified by us
๐Ÿ”ด Unverified Heard, not checked
โšช Opinion Our synthesis, reasoned but not proven

Our Experiments

# Description Key Finding Pages
1 EventBus refactor A/B (5 variants, 509 TS files, DDD, Cloudflare Workers) Process beats information: Scoutโ†’Specโ†’Workerโ†’Review pipeline ($8.45) outperformed raw context injection ($2.84 fail, $14.30 pass). Code maps hurt: $9.99 fail vs $6.63 pass without map. pipeline, principles
2 Telegram bot without spec (support reply feature) Two consecutive deterministic failures without spec. First corrupted existing handler, second missed intent entirely. specification, principles
3 24-file type refactor 253 tests, zero regressions, $5.50 API cost. Review required three iterations (both failure types were missing pre-flight checks). 106-turn task didnโ€™t contaminate subsequent tasks due to context reset. index, verification, pipeline
4 Spec review with barrel re-export Agent proposed barrel re-export during spec review, recognized it was unnecessary when asked to explain. Spec review is cheaper than code review. specification, principles
5 Pipeline variant comparison Structured pipeline: correct on first attempt, $8.45. Raw prompt: wrong, $2.84. Sequential with context: correct, $14.30. With code map: wrong, $9.99. pipeline
6 KB A/B test (generic vs structured) Generic Sonnet said โ€œgive it a code map.โ€ KB-loaded agent flagged exploration-vs-exploitation paradox with prior session evidence before writing code. index, principles
7 Edit tool investigation Persistent error pattern traced to our own extension, not the platform. Post-fix benchmark: 7.1% errors, 1.1% data loss. index
8 Model evaluation (thinking levels) Thinking level acts as compliance-to-conviction dial. Soft sycophancy identified: agent says no while providing implementation. index
9 Opus degradation incident (April 2026) Read:Edit ratio dropped from 6.6 to 2.0, thinking depth fell 67%, costs spiked 80ร—. Three-day silent degradation with no API-side signal. verification, principles

External Sources

Large-Scale Studies

  1. LinearB (2026). โ€œThe Real Impact of AI on Developer Productivity.โ€ 8.1M pull requests, 4,800 teams, 42 countries. AI-generated code: 1.7ร— more review revisions, 4.6ร— longer review wait, 32.7% acceptance rate (vs 84.4% human). Developers feel 20% faster; tasks take 19% longer end-to-end. ๐ŸŸก

  2. Stanford Meta-Harness Study (2026). Changing the harness around a fixed LLM produces a 6ร— performance gap on the same benchmark. Automated harness optimization searched configurations the way gradient descent searches weight space. ๐ŸŸก

  3. ETH Zurich (Feb 2026). 138 real-world tasks across 3 models. Auto-generated context files reduced task success by 3%. Human-written boundaries improved success by 4%. ๐ŸŸก

  4. Sonar (2026). Survey of 1,000+ developers. Only 48% verify AI output before shipping. ๐ŸŸก

  5. Microsoft Copilot Study (2026). 10-month study, 878 pull requests. โ€œThe bottleneck moved from typing speed to knowledge, judgment, and ability to articulate tasks.โ€ ๐ŸŸก

  6. BSWEN (2026). 133 cycles, 42 development phases, four models in strict isolation. GPT caught Python security issues Claude missed. Claude caught architectural violations GPT normalized. Each model had different blind spots. ๐ŸŸก

  7. Bamberg/Heidelberg (2026). Systematic analysis of 2,926 repositories across Claude Code, GitHub Copilot, Cursor, Gemini CLI, Pydantic AI. Converging on identical patterns independently. ๐ŸŸก

Research Papers

  1. Tsinghua NLAH (March 2026). โ€œNatural-Language Agent Harnesses.โ€ Harness behavior externalized as โ€œa portable executable artifact in editable natural language.โ€ ๐ŸŸก

  2. Microsoft Research RiSE (March 2026). Lahiri et al. โ€œIntent Formalizationโ€ named as a grand challenge for 2026. Intent gap: the semantic distance between what a developer means and what the system does. ๐ŸŸก

  3. Allard et al. ERL (March 2026). Agents with heuristics extracted from prior trajectories outperformed ReAct baselines by +7.8%. โ€œHeuristics provide more transferable abstractions than few-shot prompting.โ€ ๐ŸŸก

  4. ExpeL (Andrew Zhao et al., 2023). Experience Learning: three-stage self-improvement (act โ†’ reflect โ†’ extract). On HotpotQA and ALFWorld, ExpeL agents improve with each batch of trajectories. ๐ŸŸก

  5. Chroma MECW (2025). Maximum Effective Context Window. 18 frontier models tested. Universal degradation; effective window as low as 1-2% of advertised maximum in some conditions. ๐ŸŸก๐Ÿ”ด (specific numbers unverified by us)

  6. Reflexion (NeurIPS 2023). Verbal reflection with episodic memory raised HumanEval from 80% to 91%. ๐ŸŸก

  7. ADAS (ICLR 2025). Automated Design of Agentic Systems. The agent designs its own pipeline structure. ๐ŸŸก

  8. Gรถdel Agent (ICLR 2025). Recursive self-modification via confidence-based logic. ๐ŸŸก

  9. DKB (January 2026). Deterministic Knowledge Bases. AST graphs beat vector RAG and LLM-generated knowledge graphs for code navigation. ๐ŸŸก

  10. AMBIG-SWE (ICLR 2026). Benchmark for ambiguity detection in software engineering tasks. ๐ŸŸก

  11. ACE Framework. Agentic Context Engineering. Memory scoring: each unit carries a score that updates on use. Quality saturates at ~7 governed memories per entity across 500 adversarial queries. ๐ŸŸก

  12. Pavlyshyn (Jan 2026). History of constrained natural language in programming: COBOL, SQL, Simula. 60-year progression. ๐ŸŸก

Vendor Documentation & Posts

  1. Anthropic โ€” โ€œBuilding Agents with Skills.โ€ Skills as zero-cost-until-invoked context units. Self-evaluation bias documented explicitly. Claude Code architecture: tiered context loading, compaction, coordinator mode. ๐ŸŸก

  2. DSPy (Stanford, 33K stars). Prompts as learnable parameters. BootstrapFewShot, MIPROv2 search language space automatically. ๐ŸŸก

  3. Pydantic (2026). Analyzed 4,668 pull request comments, extracted 150 AGENTS.md rules. Engineering taste compiled into agent instructions. ๐ŸŸก

  4. Promptfoo (acquired by OpenAI, March 2026, $86M). 350K developers. Trajectory assertions: tool-used, tool-args-match, tool-sequence, step-count, goal-success. ๐ŸŸก

  5. spec-kit (79K stars). GitHubโ€™s 5-command SDD workflow: constitution โ†’ specify โ†’ plan โ†’ tasks โ†’ implement. 20+ agents. ๐ŸŸก

  6. Kiro (Amazon). IDE built around requirements โ†’ design โ†’ tasks. Specs live in project root, evolve with codebase. ๐ŸŸก

  7. AGENTS.md. 60,000+ repositories. Linux Foundation / Agentic AI Foundation standard. Cross-tool coordination protocol with Anthropic, Google, Microsoft, Cursor backing. ๐ŸŸก

  8. Augment Code. Single-writer rule for hotspot files. Sequential merge strategy. ๐ŸŸก

  9. OpenTelemetry GenAI SIG. Semantic conventions: gen_ai.chat for LLM calls, agent.invoke for agent steps, tool.execute for tool calls. Datadog, Honeycomb, New Relic support natively. ๐ŸŸก

  10. Simon Willison (@simonw). Rigorous public reference on agentic engineering. Tests are free and mandatory. Agents follow existing code patterns. ๐ŸŸก๐ŸŸ 

  11. Martin Fowler. Spec progression: spec-first โ†’ spec-anchored โ†’ spec-as-source. Maturity curve mapping. ๐ŸŸก

Community Reports

  1. Amazon deployment (2026). 21,000 agents, 80% weekly usage. 4 Sev-1 incidents in 90 days. 6-hour outage, ~6.3M lost orders. ~30,000 layoffs concurrent with AI scaling. ๐ŸŸ 

  2. Self-improvement tools (March 2026). Three independent projects (skill-loop, selfwrite, iterate) shipped in the same week without coordination. All focused on instruction improvement, not weight modification. ๐ŸŸ 

  3. Edit tool failure rate. Agents express edits as text replacements, which break on whitespace drift, formatting changes, and multi-cursor ambiguity. Documented across multiple tools. ๐ŸŸ 

  4. Spec sizing (900-1600 tokens). Community-reported sweet spot for structured quick-dev specs. Below 900: ambiguity risk. Above 1600: tail instruction degradation. ๐ŸŸ 

  5. Rory Teehan. Structured error logging: what happened, why, what should have happened. ๐ŸŸก

People to Follow

  1. Andrej Karpathy (@karpathy). Autoresearch: 700 commits in two days, โˆ’11% validation loss. Memory should be tree-structured, not flat. ๐ŸŸ 
  2. Mario Zechner (@badlogicgames). Built Pi. When agents self-praise, human review becomes the bottleneck. ๐ŸŸ 
  3. Harrison Chase (@hwchase17). LangChain. What does production agent orchestration actually look like at scale. ๐ŸŸ 

This reference list is maintained alongside the documentation. Sources marked with evidence levels as used in the main text.