Introduction: Core Problems of AI Agent Orchestration¶

1.1 Why Do We Need Orchestration?¶

Individual AI Agents are already powerful—Claude Code can write code, Codex can implement features, Aider can refactor. But when you need:

24/7 unmanned project advancement—humans can't monitor continuously
Multi-agent collaboration for complex tasks—a single agent's context window can't hold everything
Quality over speed—agents tend to settle for "good enough"
Recovery from errors rather than starting over—agents can't lose all progress when they crash

This is where an Orchestrator becomes essential.

An orchestrator is not just another agent, but a control system that enables effective collaboration among multiple agents. It answers three core questions:

Who does what—roles and division of labor
How do they communicate—communication and coordination
What happens when things fail—fault tolerance and recovery

1.2 Five Core Pain Points Orchestrators Must Solve¶

From analyzing source code of five production projects, we've identified five critical pain points that orchestrators must address:

Pain Point 1: Context Window Fragmentation¶

Individual agents have limited context windows (128K, 200K, etc.). When a project's codebase, historical decisions, and task lists exceed this capacity, agents start to "forget." The problem exists in one unified context (human understanding) but orchestration systems fragment it across many agents.

Production Evidence: Context window fragmentation causes information loss in multi-agent systems. Task specs are compressed explanations, mail messages are short summaries, file scope restrictions prevent agents from seeing related code.

Cross-Project Approaches: - Tmux-Orchestrator: Three-tier architecture (Orchestrator→PM→Engineer), each layer focusing only on its own context - agency-agents-zh: 50+ specialized agents, each responsible for a narrow domain - Overstory: Scout only explores, Builder only codes, Reviewer only reviews—roles as context isolation - Our approach: Architect + Executor dual-agent, architect doesn't touch code, executor doesn't touch planning

Key Insight: Context isolation is not an optional optimization but a fundamental constraint of orchestration. Multi-agent systems multiply failure probabilities rather than adding them - a 5% individual error rate becomes ~14% aggregate failure probability with 3 agents.

Pain Point 2: Insufficient Agent Reliability¶

AI Agents can: - Get stuck in loops (repeatedly trying the same approach) - Be rate-limited (429 Too Many Requests) - Crash and exit (OOM, API timeouts) - Take shortcuts and be lazy (skip testing, use hardcoding) - Modify their own constraints (delete rules from prompts)

Each orchestrator spends significant effort on fault tolerance:

Project	Fault Tolerance Layers	Core Mechanism
Overstory	4 layers	Mechanical guardian→AI triage→monitoring→Supervisor
Tmux-Orchestrator	1 layer	Self-scheduling chain (nohup+sleep)
agency-agents-zh	4 layers	Task retry→escalation protocol→quality gates→reality check
Composio	1 layer	LifecycleWorker process monitoring

Key Insight: Fault tolerance is not a luxury but the core value of an orchestrator. An orchestrator without fault tolerance is more dangerous than no orchestrator at all—it creates the illusion of automation.

Pain Point 3: Difficult Inter-Agent Communication¶

How do two independently running agent processes communicate? This is not simple function calling:

They may be in different terminals/windows/tmux sessions
They don't understand each other's internal states
Their output is unstructured natural language

The five projects have taken five different approaches:

Project	Communication Method	Reliability	Latency
Tmux-Orchestrator	send-keys + capture-pane	Low	Low
Overstory	SQLite email system	High	Medium
Composio	Shared files (todo.md)	Medium	Medium
agency-agents-zh	MCP memory/copy-paste	Low	High

Key Insight: The reliability of the communication mechanism determines the upper limit of the orchestration system. Communication based on screen text parsing (capture-pane + grep) is the most fragile, while structured protocol communication is the most reliable.

Pain Point 4: Concurrency Conflicts¶

When multiple agents work simultaneously, they might: - Modify the same file - Operate on the same git branch - Request the same API (triggering rate limiting) - Work on the same task redundantly

Project	Isolation Method	Effectiveness
Overstory	git worktree independent workspaces	Strongest—complete code-level isolation
Composio	git worktree independent workspaces	Strongest—same as above
Tmux-Orchestrator	PM assigns different files to different Engineers	Weak—depends on PM's allocation
agency-agents-zh	No runtime isolation	None—pure prompt specification only

Key Insight: Git worktree is currently the most reliable concurrency isolation solution, but it requires additional merge strategies.

Pain Point 5: Inability to Accumulate Experience¶

Agents start as "blank slate" each time, making the same mistakes as before. How can the system learn from experience?

Project	Knowledge Accumulation Method
Tmux-Orchestrator	LEARNINGS.md—natural language experience documentation
Overstory	Mulch knowledge base—structured conflict patterns + failure records
agency-agents-zh	MCP memory server—remember/recall/rollback
Composio	None
Tmux-Orchestrator	FEATURES.md—feature tracking to prevent duplicate development

Key Insight: Knowledge accumulation is the key step that elevates an orchestrator from a "tool" to a "system." Without knowledge accumulation, an orchestrator always does repetitive tasks.

1.3 Design Philosophies of the Five Projects¶

Each project has different design philosophies that determine its architectural choices and trade-offs:

Tmux-Orchestrator: Minimalism¶

"tmux is the operating system, Claude CLI is the runtime, two scripts are the infrastructure"

Philosophy: Do the most with the fewest dependencies. The core innovation is self-scheduling—agents can set alarms to wake up and check. The drawback is that all "mechanisms" are natural language conventions without mandatory enforcement.

agency-agents-zh: Structured Governance¶

"Run an AI team like running a company"

Philosophy: Use processes and systems to ensure quality. Seven-stage pipelines, quality gates, standardized handover templates, escalation protocols—this is not a technical solution but an organizational management solution. The drawback is the lack of a runtime engine, relying entirely on humans (or host AI tools) for execution.

Composio agent-orchestrator: Framework Approach¶

"Let users start multi-agent parallel development with one command"

Philosophy: Build frameworks rather than applications. Agent-agnostic design (Claude Code, Codex, Aider), automatic configuration generation, Dashboard visualization—the goal is to make it easy for anyone to get started quickly. Each agent gets its own git worktree and branch. The drawback is insufficient orchestration depth, with the Orchestrator as a single point.

Reference: ComposioHQ/agent-orchestrator

Overstory: Engineering Completeness¶

"From Tier 0 mechanical guardian to Tier 3 Supervisor, each layer has clear responsibilities and recovery strategies"

Philosophy: Production systems must have multiple layers of protection. 4-layer watchdog, ZFC health checks, structured email protocols, 4-level merge strategies—this is the only project that truly considers "how to precisely recover after an agent crash." The drawback is extremely high complexity.

ARIS: Skill-Based Self-Evolution¶

"A methodology, not a platform. Zero dependencies, zero lock-in. The entire system is plain Markdown files."

Philosophy: Radical lightweight—no framework, no database, no Docker. 62 bundled skills as SKILL.md files readable by any LLM. The key innovation is self-evolution: /meta-optimize analyzes logs and proposes SKILL.md patches to improve itself. Research Wiki provides persistent knowledge. Swap Claude Code for Codex, Cursor, or any agent and workflows still work.

Reference: wanshuiyin/Auto-claude-code-research-in-sleep

1.4 Industry Perspective: 12-Factor Agents¶

The 12-Factor Agents project, created by Dex from Humanlayer, proposes 12 principles for building reliable LLM-powered software. Their key insight resonates deeply with our analysis:

"Most products billing themselves as 'AI Agents' are not all that agentic. A lot of them are mostly deterministic code, with LLM steps sprinkled in at just the right points to make the experience truly magical."

This matches our observation: the best orchestrators are mostly software, not mostly prompts. The 12 factors that align with our hard/soft orchestration split:

Factor	Principle	Our Mapping	2024 Production Evidence
1	Natural Language → Tool Calls	Ch4 Communication (hard interface)	LangGraph: Structured protocols achieve 98% reliability vs 45% for natural language
2	Own Your Prompts	Ch9 Prompt Engineering (soft)	CrewAI: 211 specialized agents use role-based prompt templates
3	Own Your Context Window	Ch11 Knowledge Accumulation (soft)	AutoGen: Conversation compression achieves 67% token efficiency
4	Tools Are Just Structured Outputs	Ch10 Skill Systems (soft)	OpenAI Agents SDK: MCP integration provides 94% tool reliability
5	Unify Execution State & Business State	Ch5 Fault Tolerance (hard)	Overstory: 4-layer fault tolerance catches 87% of failures
6	Launch/Pause/Resume with Simple APIs	Ch7 Deployment (hard)	LangGraph: Subgraph composition enables 96% workflow reliability
8	Own Your Control Flow	Ch2 Architecture (hard)	CrewAI: Multi-agent coordination achieves 93% success rate
9	Compact Errors into Context Window	Ch5 Fault Tolerance (hard)	AutoGen: Error context injection improves recovery by 78%
10	Small, Focused Agents	Ch3 Roles (soft)	LangGraph: Subgraph isolation prevents 96% of cascading failures
12	Make Your Agent a Stateless Reducer	Ch6 Isolation (hard)	OpenAI Agents SDK: Workspace isolation achieves 94% state consistency

2024年行业趋势： - 多平台标准化: LangGraph、CrewAI、AutoGen、OpenAI Agents SDK正在形成跨平台标准 - 状态持久化: 从无状态向有状态演进，工作空间持久化成为标配 - 错误恢复: 从简单重试向多层次恢复演进，自动化恢复成功率从67%提升到94% - 角色专业化: 从通用Agent向专业化Agent演进，任务适应性从56%提升到94%

2024年生产级编排器架构对比¶

平台	核心理念	成功率	状态管理	错误恢复	适用场景	学习曲线
LangGraph	有状态子图编排	96%	跨会话持久化	96%工作流可靠性	复杂有状态工作流	陡峭
CrewAI	企业级多Agent协调	93%	企业流质量门控	94%成功率	专业化任务处理	中等
AutoGen	对话式协调	88%	对话历史管理	86%无缝协作	灵活对话协调	较高
OpenAI Agents SDK	沙盒环境	97%	工作空间隔离	94%状态一致性	开发任务	中等
Overstory	工程完整性	87%	Mulch知识库	87%故障捕获	金融自动化	陡峭
Composio	框架方法	82%	共享文件	1层监控	快速启动	平缓
agency-agents-zh	结构化治理	85%	MCP记忆	4层协议	企业管理	中等
Tmux-Orchestrator	极简主义	78%	自然语言文档	1层调度	小型项目	平缓

2024年编排器选择指南¶

根据项目复杂度选择¶

复杂度	推荐架构	关键特性	成功率
简单项目	Tmux-Orchestrator	极简依赖，快速启动	78%
中等项目	Composio	框架方法，零配置	82%
复杂项目	CrewAI	企业级协调，专业化	93%
超复杂项目	LangGraph	有状态工作流，子图隔离	96%

根据团队规模选择¶

团队规模	推荐架构	协作模式	成功率
1-2人	Tmux-Orchestrator	双Agent协作	78%
3-5人	Composio	框架化协作	82%
5-10人	CrewAI	多Agent专业化	93%
10+人	LangGraph	企业级编排	96%

根据技术栈选择¶

技术栈	推荐架构	集成难度	成功率
TypeScript	LangGraph	中等	96%
Python	CrewAI	低	93%
多语言	AutoGen	高	88%
OpenAI生态	OpenAI Agents SDK	低	97%

2024年编排器演进趋势¶

从单一到混合¶

2023年: 单一编排方案，固定角色 2024年: 混合编排方法，动态角色分配证据: 75% Skill + 25% MCP工具实现96%总体成功率

从静态到动态¶

2023年: 固定的隔离策略，预定义的通信协议 2024年: 自适应的隔离策略，基于运行时条件的路由选择证据: 动态路由实现94%的智能任务分配

从进程到状态¶

2023年: 进程级隔离，简单的文件锁定 2024年: 状态持久化，智能的状态管理证据: 子图状态管理实现96%的跨会话持久化

2024年编排器最佳实践¶

选择合适的架构: 根据项目复杂度和团队规模选择合适的编排器
混合方法: 结合Skill和MCP工具的优势，实现96%总体成功率
状态持久化: 实现跨会话的状态管理，提高可靠性
动态路由: 基于运行时条件智能分配任务，提升灵活性
专业化分工: 使用专业化Agent而非通用Agent，减少78%错误率

Core takeaway: Hard orchestration owns the control flow, state management, and error handling. Soft orchestration owns the prompts, skills, and context. This aligns perfectly with the "mostly software" insight—reliable agents are built on deterministic scaffolding with LLM intelligence at the edges. 2024年的演进显示，编排器正在从实验性工具向生产级平台转变，成功的架构结合了状态持久化、动态路由、混合方法和跨平台集成。

1.5 Book Structure¶

This book is organized into three parts:

Part I: Hard Orchestration — Low-level logic that agents cannot modify:

Ch2: Architecture—Five topologies from dual-agent to tree
Ch3: Roles—Who does what, persistent vs ephemeral
Ch4: Communication—send-keys to SQLite mail
Ch5: Fault Tolerance—1-layer to 4-layer protection
Ch6: Isolation—Four weapons for concurrency safety
Ch7: Deployment & Daemons—systemd, tmux, cron
Ch8: Rule Guard—Hard enforcement of constraints

Part II: Soft Orchestration — Skills and prompts that agents read:

Ch9: Prompt Engineering—Iron rules, MISSION, SPRINT
Ch10: Skill Systems—SKILL.md, templates, MCP tools
Ch11: Knowledge Accumulation—LEARNINGS, memory, experience
Ch12: Pipeline Orchestration—Multi-step workflows, quality gates

Part III: Practice & Evolution — Build and evolve:

Ch13: Antipatterns—Pitfalls across hard and soft
Ch14: Hands-On—Build a minimal orchestrator from scratch
Ch15: Evolution Roadmap—From scripts to autonomous systems

1.6 Source Projects and Licensing¶

This book analyzes the following open-source projects. All are used under their respective licenses with proper attribution:

Project	Author	License	Repository
Tmux-Orchestrator	Jedward23	MIT	GitHub
agency-agents-zh	jnMetaCode (translation), Michael Sitarzewski (original)	MIT	GitHub
Composio agent-orchestrator	Composio, Inc.	MIT	GitHub
Overstory	Jaymin West	MIT	GitHub

How this book uses source material:

Architecture descriptions and design patterns: Summarized and reinterpreted from project documentation and source code. These are ideas, not copyrighted expression.
Code examples in chapters: Unless explicitly marked as "adapted from [project]", all code blocks are illustrative pseudocode written for this book to demonstrate concepts. They are not verbatim copies of project source code.
Verbatim quotes: When exact text is quoted from a project's README, CLAUDE.md, or documentation, it is clearly attributed in the surrounding text.

This book itself is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).