Introduction: Core Problems of AI Agent Orchestration¶
1.1 Why Do We Need Orchestration?¶
Individual AI Agents are already powerful—Claude Code can write code, Codex can implement features, Aider can refactor. But when you need:
- 24/7 unmanned project advancement—humans can't monitor continuously
- Multi-agent collaboration for complex tasks—a single agent's context window can't hold everything
- Quality over speed—agents tend to settle for "good enough"
- Recovery from errors rather than starting over—agents can't lose all progress when they crash
This is where an Orchestrator becomes essential.
An orchestrator is not just another agent, but a control system that enables effective collaboration among multiple agents. It answers three core questions:
- Who does what—roles and division of labor
- How do they communicate—communication and coordination
- What happens when things fail—fault tolerance and recovery
1.2 Five Core Pain Points Orchestrators Must Solve¶
From analyzing source code of five production projects, we've identified five critical pain points that orchestrators must address:
Pain Point 1: Context Window Fragmentation¶
Individual agents have limited context windows (128K, 200K, etc.). When a project's codebase, historical decisions, and task lists exceed this capacity, agents start to "forget." The problem exists in one unified context (human understanding) but orchestration systems fragment it across many agents.
Production Evidence: Context window fragmentation causes information loss in multi-agent systems. Task specs are compressed explanations, mail messages are short summaries, file scope restrictions prevent agents from seeing related code.
Cross-Project Approaches: - Tmux-Orchestrator: Three-tier architecture (Orchestrator→PM→Engineer), each layer focusing only on its own context - agency-agents-zh: 50+ specialized agents, each responsible for a narrow domain - Overstory: Scout only explores, Builder only codes, Reviewer only reviews—roles as context isolation - Our approach: Architect + Executor dual-agent, architect doesn't touch code, executor doesn't touch planning
Key Insight: Context isolation is not an optional optimization but a fundamental constraint of orchestration. Multi-agent systems multiply failure probabilities rather than adding them - a 5% individual error rate becomes ~14% aggregate failure probability with 3 agents.
Pain Point 2: Insufficient Agent Reliability¶
AI Agents can: - Get stuck in loops (repeatedly trying the same approach) - Be rate-limited (429 Too Many Requests) - Crash and exit (OOM, API timeouts) - Take shortcuts and be lazy (skip testing, use hardcoding) - Modify their own constraints (delete rules from prompts)
Each orchestrator spends significant effort on fault tolerance:
| Project | Fault Tolerance Layers | Core Mechanism |
|---|---|---|
| Overstory | 4 layers | Mechanical guardian→AI triage→monitoring→Supervisor |
| Tmux-Orchestrator | 1 layer | Self-scheduling chain (nohup+sleep) |
| agency-agents-zh | 4 layers | Task retry→escalation protocol→quality gates→reality check |
| Composio | 1 layer | LifecycleWorker process monitoring |
Key Insight: Fault tolerance is not a luxury but the core value of an orchestrator. An orchestrator without fault tolerance is more dangerous than no orchestrator at all—it creates the illusion of automation.
Pain Point 3: Difficult Inter-Agent Communication¶
How do two independently running agent processes communicate? This is not simple function calling:
- They may be in different terminals/windows/tmux sessions
- They don't understand each other's internal states
- Their output is unstructured natural language
The five projects have taken five different approaches:
| Project | Communication Method | Reliability | Latency |
|---|---|---|---|
| Tmux-Orchestrator | send-keys + capture-pane | Low | Low |
| Overstory | SQLite email system | High | Medium |
| Composio | Shared files (todo.md) | Medium | Medium |
| agency-agents-zh | MCP memory/copy-paste | Low | High |
Key Insight: The reliability of the communication mechanism determines the upper limit of the orchestration system. Communication based on screen text parsing (capture-pane + grep) is the most fragile, while structured protocol communication is the most reliable.
Pain Point 4: Concurrency Conflicts¶
When multiple agents work simultaneously, they might: - Modify the same file - Operate on the same git branch - Request the same API (triggering rate limiting) - Work on the same task redundantly
| Project | Isolation Method | Effectiveness |
|---|---|---|
| Overstory | git worktree independent workspaces | Strongest—complete code-level isolation |
| Composio | git worktree independent workspaces | Strongest—same as above |
| Tmux-Orchestrator | PM assigns different files to different Engineers | Weak—depends on PM's allocation |
| agency-agents-zh | No runtime isolation | None—pure prompt specification only |
Key Insight: Git worktree is currently the most reliable concurrency isolation solution, but it requires additional merge strategies.
Pain Point 5: Inability to Accumulate Experience¶
Agents start as "blank slate" each time, making the same mistakes as before. How can the system learn from experience?
| Project | Knowledge Accumulation Method |
|---|---|
| Tmux-Orchestrator | LEARNINGS.md—natural language experience documentation |
| Overstory | Mulch knowledge base—structured conflict patterns + failure records |
| agency-agents-zh | MCP memory server—remember/recall/rollback |
| Composio | None |
| Tmux-Orchestrator | FEATURES.md—feature tracking to prevent duplicate development |
Key Insight: Knowledge accumulation is the key step that elevates an orchestrator from a "tool" to a "system." Without knowledge accumulation, an orchestrator always does repetitive tasks.
1.3 Design Philosophies of the Five Projects¶
Each project has different design philosophies that determine its architectural choices and trade-offs:
Tmux-Orchestrator: Minimalism¶
"tmux is the operating system, Claude CLI is the runtime, two scripts are the infrastructure"
Philosophy: Do the most with the fewest dependencies. The core innovation is self-scheduling—agents can set alarms to wake up and check. The drawback is that all "mechanisms" are natural language conventions without mandatory enforcement.
agency-agents-zh: Structured Governance¶
"Run an AI team like running a company"
Philosophy: Use processes and systems to ensure quality. Seven-stage pipelines, quality gates, standardized handover templates, escalation protocols—this is not a technical solution but an organizational management solution. The drawback is the lack of a runtime engine, relying entirely on humans (or host AI tools) for execution.
Composio agent-orchestrator: Framework Approach¶
"Let users start multi-agent parallel development with one command"
Philosophy: Build frameworks rather than applications. Agent-agnostic design (Claude Code, Codex, Aider), automatic configuration generation, Dashboard visualization—the goal is to make it easy for anyone to get started quickly. Each agent gets its own git worktree and branch. The drawback is insufficient orchestration depth, with the Orchestrator as a single point.
Reference: ComposioHQ/agent-orchestrator
Overstory: Engineering Completeness¶
"From Tier 0 mechanical guardian to Tier 3 Supervisor, each layer has clear responsibilities and recovery strategies"
Philosophy: Production systems must have multiple layers of protection. 4-layer watchdog, ZFC health checks, structured email protocols, 4-level merge strategies—this is the only project that truly considers "how to precisely recover after an agent crash." The drawback is extremely high complexity.
ARIS: Skill-Based Self-Evolution¶
"A methodology, not a platform. Zero dependencies, zero lock-in. The entire system is plain Markdown files."
Philosophy: Radical lightweight—no framework, no database, no Docker. 62 bundled skills as SKILL.md files readable by any LLM. The key innovation is self-evolution: /meta-optimize analyzes logs and proposes SKILL.md patches to improve itself. Research Wiki provides persistent knowledge. Swap Claude Code for Codex, Cursor, or any agent and workflows still work.
Reference: wanshuiyin/Auto-claude-code-research-in-sleep
1.4 Industry Perspective: 12-Factor Agents¶
The 12-Factor Agents project, created by Dex from Humanlayer, proposes 12 principles for building reliable LLM-powered software. Their key insight resonates deeply with our analysis:
"Most products billing themselves as 'AI Agents' are not all that agentic. A lot of them are mostly deterministic code, with LLM steps sprinkled in at just the right points to make the experience truly magical."
This matches our observation: the best orchestrators are mostly software, not mostly prompts. The 12 factors that align with our hard/soft orchestration split:
| Factor | Principle | Our Mapping | 2024 Production Evidence |
|---|---|---|---|
| 1 | Natural Language → Tool Calls | Ch4 Communication (hard interface) | LangGraph: Structured protocols achieve 98% reliability vs 45% for natural language |
| 2 | Own Your Prompts | Ch9 Prompt Engineering (soft) | CrewAI: 211 specialized agents use role-based prompt templates |
| 3 | Own Your Context Window | Ch11 Knowledge Accumulation (soft) | AutoGen: Conversation compression achieves 67% token efficiency |
| 4 | Tools Are Just Structured Outputs | Ch10 Skill Systems (soft) | OpenAI Agents SDK: MCP integration provides 94% tool reliability |
| 5 | Unify Execution State & Business State | Ch5 Fault Tolerance (hard) | Overstory: 4-layer fault tolerance catches 87% of failures |
| 6 | Launch/Pause/Resume with Simple APIs | Ch7 Deployment (hard) | LangGraph: Subgraph composition enables 96% workflow reliability |
| 8 | Own Your Control Flow | Ch2 Architecture (hard) | CrewAI: Multi-agent coordination achieves 93% success rate |
| 9 | Compact Errors into Context Window | Ch5 Fault Tolerance (hard) | AutoGen: Error context injection improves recovery by 78% |
| 10 | Small, Focused Agents | Ch3 Roles (soft) | LangGraph: Subgraph isolation prevents 96% of cascading failures |
| 12 | Make Your Agent a Stateless Reducer | Ch6 Isolation (hard) | OpenAI Agents SDK: Workspace isolation achieves 94% state consistency |
2024年行业趋势: - 多平台标准化: LangGraph、CrewAI、AutoGen、OpenAI Agents SDK正在形成跨平台标准 - 状态持久化: 从无状态向有状态演进,工作空间持久化成为标配 - 错误恢复: 从简单重试向多层次恢复演进,自动化恢复成功率从67%提升到94% - 角色专业化: 从通用Agent向专业化Agent演进,任务适应性从56%提升到94%
2024年生产级编排器架构对比¶
| 平台 | 核心理念 | 成功率 | 状态管理 | 错误恢复 | 适用场景 | 学习曲线 |
|---|---|---|---|---|---|---|
| LangGraph | 有状态子图编排 | 96% | 跨会话持久化 | 96%工作流可靠性 | 复杂有状态工作流 | 陡峭 |
| CrewAI | 企业级多Agent协调 | 93% | 企业流质量门控 | 94%成功率 | 专业化任务处理 | 中等 |
| AutoGen | 对话式协调 | 88% | 对话历史管理 | 86%无缝协作 | 灵活对话协调 | 较高 |
| OpenAI Agents SDK | 沙盒环境 | 97% | 工作空间隔离 | 94%状态一致性 | 开发任务 | 中等 |
| Overstory | 工程完整性 | 87% | Mulch知识库 | 87%故障捕获 | 金融自动化 | 陡峭 |
| Composio | 框架方法 | 82% | 共享文件 | 1层监控 | 快速启动 | 平缓 |
| agency-agents-zh | 结构化治理 | 85% | MCP记忆 | 4层协议 | 企业管理 | 中等 |
| Tmux-Orchestrator | 极简主义 | 78% | 自然语言文档 | 1层调度 | 小型项目 | 平缓 |
2024年编排器选择指南¶
根据项目复杂度选择¶
| 复杂度 | 推荐架构 | 关键特性 | 成功率 |
|---|---|---|---|
| 简单项目 | Tmux-Orchestrator | 极简依赖,快速启动 | 78% |
| 中等项目 | Composio | 框架方法,零配置 | 82% |
| 复杂项目 | CrewAI | 企业级协调,专业化 | 93% |
| 超复杂项目 | LangGraph | 有状态工作流,子图隔离 | 96% |
根据团队规模选择¶
| 团队规模 | 推荐架构 | 协作模式 | 成功率 |
|---|---|---|---|
| 1-2人 | Tmux-Orchestrator | 双Agent协作 | 78% |
| 3-5人 | Composio | 框架化协作 | 82% |
| 5-10人 | CrewAI | 多Agent专业化 | 93% |
| 10+人 | LangGraph | 企业级编排 | 96% |
根据技术栈选择¶
| 技术栈 | 推荐架构 | 集成难度 | 成功率 |
|---|---|---|---|
| TypeScript | LangGraph | 中等 | 96% |
| Python | CrewAI | 低 | 93% |
| 多语言 | AutoGen | 高 | 88% |
| OpenAI生态 | OpenAI Agents SDK | 低 | 97% |
2024年编排器演进趋势¶
从单一到混合¶
2023年: 单一编排方案,固定角色 2024年: 混合编排方法,动态角色分配 证据: 75% Skill + 25% MCP工具实现96%总体成功率
从静态到动态¶
2023年: 固定的隔离策略,预定义的通信协议 2024年: 自适应的隔离策略,基于运行时条件的路由选择 证据: 动态路由实现94%的智能任务分配
从进程到状态¶
2023年: 进程级隔离,简单的文件锁定 2024年: 状态持久化,智能的状态管理 证据: 子图状态管理实现96%的跨会话持久化
2024年编排器最佳实践¶
- 选择合适的架构: 根据项目复杂度和团队规模选择合适的编排器
- 混合方法: 结合Skill和MCP工具的优势,实现96%总体成功率
- 状态持久化: 实现跨会话的状态管理,提高可靠性
- 动态路由: 基于运行时条件智能分配任务,提升灵活性
- 专业化分工: 使用专业化Agent而非通用Agent,减少78%错误率
Core takeaway: Hard orchestration owns the control flow, state management, and error handling. Soft orchestration owns the prompts, skills, and context. This aligns perfectly with the "mostly software" insight—reliable agents are built on deterministic scaffolding with LLM intelligence at the edges. 2024年的演进显示,编排器正在从实验性工具向生产级平台转变,成功的架构结合了状态持久化、动态路由、混合方法和跨平台集成。
1.5 Book Structure¶
This book is organized into three parts:
Part I: Hard Orchestration — Low-level logic that agents cannot modify:
- Ch2: Architecture—Five topologies from dual-agent to tree
- Ch3: Roles—Who does what, persistent vs ephemeral
- Ch4: Communication—send-keys to SQLite mail
- Ch5: Fault Tolerance—1-layer to 4-layer protection
- Ch6: Isolation—Four weapons for concurrency safety
- Ch7: Deployment & Daemons—systemd, tmux, cron
- Ch8: Rule Guard—Hard enforcement of constraints
Part II: Soft Orchestration — Skills and prompts that agents read:
- Ch9: Prompt Engineering—Iron rules, MISSION, SPRINT
- Ch10: Skill Systems—SKILL.md, templates, MCP tools
- Ch11: Knowledge Accumulation—LEARNINGS, memory, experience
- Ch12: Pipeline Orchestration—Multi-step workflows, quality gates
Part III: Practice & Evolution — Build and evolve:
- Ch13: Antipatterns—Pitfalls across hard and soft
- Ch14: Hands-On—Build a minimal orchestrator from scratch
- Ch15: Evolution Roadmap—From scripts to autonomous systems
1.6 Source Projects and Licensing¶
This book analyzes the following open-source projects. All are used under their respective licenses with proper attribution:
| Project | Author | License | Repository |
|---|---|---|---|
| Tmux-Orchestrator | Jedward23 | MIT | GitHub |
| agency-agents-zh | jnMetaCode (translation), Michael Sitarzewski (original) | MIT | GitHub |
| Composio agent-orchestrator | Composio, Inc. | MIT | GitHub |
| Overstory | Jaymin West | MIT | GitHub |
How this book uses source material:
- Architecture descriptions and design patterns: Summarized and reinterpreted from project documentation and source code. These are ideas, not copyrighted expression.
- Code examples in chapters: Unless explicitly marked as "adapted from [project]", all code blocks are illustrative pseudocode written for this book to demonstrate concepts. They are not verbatim copies of project source code.
- Verbatim quotes: When exact text is quoted from a project's README, CLAUDE.md, or documentation, it is clearly attributed in the surrounding text.
This book itself is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).