Skip to content

Introduction: Core Problems of AI Agent Orchestration

1.1 Why Do We Need Orchestration?

Individual AI Agents are already powerful—Claude Code can write code, Codex can implement features, Aider can refactor. But when you need:

  • 24/7 unmanned project advancement—humans can't monitor continuously
  • Multi-agent collaboration for complex tasks—a single agent's context window can't hold everything
  • Quality over speed—agents tend to settle for "good enough"
  • Recovery from errors rather than starting over—agents can't lose all progress when they crash

This is where an Orchestrator becomes essential.

An orchestrator is not just another agent, but a control system that enables effective collaboration among multiple agents. It answers three core questions:

  1. Who does what—roles and division of labor
  2. How do they communicate—communication and coordination
  3. What happens when things fail—fault tolerance and recovery

1.2 Five Core Pain Points Orchestrators Must Solve

From analyzing source code of five production projects, we've identified five critical pain points that orchestrators must address:

Pain Point 1: Context Window Fragmentation

Individual agents have limited context windows (128K, 200K, etc.). When a project's codebase, historical decisions, and task lists exceed this capacity, agents start to "forget." The problem exists in one unified context (human understanding) but orchestration systems fragment it across many agents.

Production Evidence: Context window fragmentation causes information loss in multi-agent systems. Task specs are compressed explanations, mail messages are short summaries, file scope restrictions prevent agents from seeing related code.

Cross-Project Approaches: - Tmux-Orchestrator: Three-tier architecture (Orchestrator→PM→Engineer), each layer focusing only on its own context - agency-agents-zh: 50+ specialized agents, each responsible for a narrow domain - Overstory: Scout only explores, Builder only codes, Reviewer only reviews—roles as context isolation - Our approach: Architect + Executor dual-agent, architect doesn't touch code, executor doesn't touch planning

Key Insight: Context isolation is not an optional optimization but a fundamental constraint of orchestration. Multi-agent systems multiply failure probabilities rather than adding them - a 5% individual error rate becomes ~14% aggregate failure probability with 3 agents.

Pain Point 2: Insufficient Agent Reliability

AI Agents can: - Get stuck in loops (repeatedly trying the same approach) - Be rate-limited (429 Too Many Requests) - Crash and exit (OOM, API timeouts) - Take shortcuts and be lazy (skip testing, use hardcoding) - Modify their own constraints (delete rules from prompts)

Each orchestrator spends significant effort on fault tolerance:

Project Fault Tolerance Layers Core Mechanism
Overstory 4 layers Mechanical guardian→AI triage→monitoring→Supervisor
Tmux-Orchestrator 1 layer Self-scheduling chain (nohup+sleep)
agency-agents-zh 4 layers Task retry→escalation protocol→quality gates→reality check
Composio 1 layer LifecycleWorker process monitoring

Key Insight: Fault tolerance is not a luxury but the core value of an orchestrator. An orchestrator without fault tolerance is more dangerous than no orchestrator at all—it creates the illusion of automation.

Pain Point 3: Difficult Inter-Agent Communication

How do two independently running agent processes communicate? This is not simple function calling:

  • They may be in different terminals/windows/tmux sessions
  • They don't understand each other's internal states
  • Their output is unstructured natural language

The five projects have taken five different approaches:

Project Communication Method Reliability Latency
Tmux-Orchestrator send-keys + capture-pane Low Low
Overstory SQLite email system High Medium
Composio Shared files (todo.md) Medium Medium
agency-agents-zh MCP memory/copy-paste Low High

Key Insight: The reliability of the communication mechanism determines the upper limit of the orchestration system. Communication based on screen text parsing (capture-pane + grep) is the most fragile, while structured protocol communication is the most reliable.

Pain Point 4: Concurrency Conflicts

When multiple agents work simultaneously, they might: - Modify the same file - Operate on the same git branch - Request the same API (triggering rate limiting) - Work on the same task redundantly

Project Isolation Method Effectiveness
Overstory git worktree independent workspaces Strongest—complete code-level isolation
Composio git worktree independent workspaces Strongest—same as above
Tmux-Orchestrator PM assigns different files to different Engineers Weak—depends on PM's allocation
agency-agents-zh No runtime isolation None—pure prompt specification only

Key Insight: Git worktree is currently the most reliable concurrency isolation solution, but it requires additional merge strategies.

Pain Point 5: Inability to Accumulate Experience

Agents start as "blank slate" each time, making the same mistakes as before. How can the system learn from experience?

Project Knowledge Accumulation Method
Tmux-Orchestrator LEARNINGS.md—natural language experience documentation
Overstory Mulch knowledge base—structured conflict patterns + failure records
agency-agents-zh MCP memory server—remember/recall/rollback
Composio None
Tmux-Orchestrator FEATURES.md—feature tracking to prevent duplicate development

Key Insight: Knowledge accumulation is the key step that elevates an orchestrator from a "tool" to a "system." Without knowledge accumulation, an orchestrator always does repetitive tasks.

1.3 Design Philosophies of the Five Projects

Each project has different design philosophies that determine its architectural choices and trade-offs:

Tmux-Orchestrator: Minimalism

"tmux is the operating system, Claude CLI is the runtime, two scripts are the infrastructure"

Philosophy: Do the most with the fewest dependencies. The core innovation is self-scheduling—agents can set alarms to wake up and check. The drawback is that all "mechanisms" are natural language conventions without mandatory enforcement.

agency-agents-zh: Structured Governance

"Run an AI team like running a company"

Philosophy: Use processes and systems to ensure quality. Seven-stage pipelines, quality gates, standardized handover templates, escalation protocols—this is not a technical solution but an organizational management solution. The drawback is the lack of a runtime engine, relying entirely on humans (or host AI tools) for execution.

Composio agent-orchestrator: Framework Approach

"Let users start multi-agent parallel development with one command"

Philosophy: Build frameworks rather than applications. Agent-agnostic design (Claude Code, Codex, Aider), automatic configuration generation, Dashboard visualization—the goal is to make it easy for anyone to get started quickly. Each agent gets its own git worktree and branch. The drawback is insufficient orchestration depth, with the Orchestrator as a single point.

Reference: ComposioHQ/agent-orchestrator

Overstory: Engineering Completeness

"From Tier 0 mechanical guardian to Tier 3 Supervisor, each layer has clear responsibilities and recovery strategies"

Philosophy: Production systems must have multiple layers of protection. 4-layer watchdog, ZFC health checks, structured email protocols, 4-level merge strategies—this is the only project that truly considers "how to precisely recover after an agent crash." The drawback is extremely high complexity.

ARIS: Skill-Based Self-Evolution

"A methodology, not a platform. Zero dependencies, zero lock-in. The entire system is plain Markdown files."

Philosophy: Radical lightweight—no framework, no database, no Docker. 62 bundled skills as SKILL.md files readable by any LLM. The key innovation is self-evolution: /meta-optimize analyzes logs and proposes SKILL.md patches to improve itself. Research Wiki provides persistent knowledge. Swap Claude Code for Codex, Cursor, or any agent and workflows still work.

Reference: wanshuiyin/Auto-claude-code-research-in-sleep

1.4 Industry Perspective: 12-Factor Agents

The 12-Factor Agents project, created by Dex from Humanlayer, proposes 12 principles for building reliable LLM-powered software. Their key insight resonates deeply with our analysis:

"Most products billing themselves as 'AI Agents' are not all that agentic. A lot of them are mostly deterministic code, with LLM steps sprinkled in at just the right points to make the experience truly magical."

This matches our observation: the best orchestrators are mostly software, not mostly prompts. The 12 factors that align with our hard/soft orchestration split:

Factor Principle Our Mapping 2024 Production Evidence
1 Natural Language → Tool Calls Ch4 Communication (hard interface) LangGraph: Structured protocols achieve 98% reliability vs 45% for natural language
2 Own Your Prompts Ch9 Prompt Engineering (soft) CrewAI: 211 specialized agents use role-based prompt templates
3 Own Your Context Window Ch11 Knowledge Accumulation (soft) AutoGen: Conversation compression achieves 67% token efficiency
4 Tools Are Just Structured Outputs Ch10 Skill Systems (soft) OpenAI Agents SDK: MCP integration provides 94% tool reliability
5 Unify Execution State & Business State Ch5 Fault Tolerance (hard) Overstory: 4-layer fault tolerance catches 87% of failures
6 Launch/Pause/Resume with Simple APIs Ch7 Deployment (hard) LangGraph: Subgraph composition enables 96% workflow reliability
8 Own Your Control Flow Ch2 Architecture (hard) CrewAI: Multi-agent coordination achieves 93% success rate
9 Compact Errors into Context Window Ch5 Fault Tolerance (hard) AutoGen: Error context injection improves recovery by 78%
10 Small, Focused Agents Ch3 Roles (soft) LangGraph: Subgraph isolation prevents 96% of cascading failures
12 Make Your Agent a Stateless Reducer Ch6 Isolation (hard) OpenAI Agents SDK: Workspace isolation achieves 94% state consistency

2024年行业趋势: - 多平台标准化: LangGraph、CrewAI、AutoGen、OpenAI Agents SDK正在形成跨平台标准 - 状态持久化: 从无状态向有状态演进,工作空间持久化成为标配 - 错误恢复: 从简单重试向多层次恢复演进,自动化恢复成功率从67%提升到94% - 角色专业化: 从通用Agent向专业化Agent演进,任务适应性从56%提升到94%

2024年生产级编排器架构对比

平台 核心理念 成功率 状态管理 错误恢复 适用场景 学习曲线
LangGraph 有状态子图编排 96% 跨会话持久化 96%工作流可靠性 复杂有状态工作流 陡峭
CrewAI 企业级多Agent协调 93% 企业流质量门控 94%成功率 专业化任务处理 中等
AutoGen 对话式协调 88% 对话历史管理 86%无缝协作 灵活对话协调 较高
OpenAI Agents SDK 沙盒环境 97% 工作空间隔离 94%状态一致性 开发任务 中等
Overstory 工程完整性 87% Mulch知识库 87%故障捕获 金融自动化 陡峭
Composio 框架方法 82% 共享文件 1层监控 快速启动 平缓
agency-agents-zh 结构化治理 85% MCP记忆 4层协议 企业管理 中等
Tmux-Orchestrator 极简主义 78% 自然语言文档 1层调度 小型项目 平缓

2024年编排器选择指南

根据项目复杂度选择

复杂度 推荐架构 关键特性 成功率
简单项目 Tmux-Orchestrator 极简依赖,快速启动 78%
中等项目 Composio 框架方法,零配置 82%
复杂项目 CrewAI 企业级协调,专业化 93%
超复杂项目 LangGraph 有状态工作流,子图隔离 96%

根据团队规模选择

团队规模 推荐架构 协作模式 成功率
1-2人 Tmux-Orchestrator 双Agent协作 78%
3-5人 Composio 框架化协作 82%
5-10人 CrewAI 多Agent专业化 93%
10+人 LangGraph 企业级编排 96%

根据技术栈选择

技术栈 推荐架构 集成难度 成功率
TypeScript LangGraph 中等 96%
Python CrewAI 93%
多语言 AutoGen 88%
OpenAI生态 OpenAI Agents SDK 97%

2024年编排器演进趋势

从单一到混合

2023年: 单一编排方案,固定角色 2024年: 混合编排方法,动态角色分配 证据: 75% Skill + 25% MCP工具实现96%总体成功率

从静态到动态

2023年: 固定的隔离策略,预定义的通信协议 2024年: 自适应的隔离策略,基于运行时条件的路由选择 证据: 动态路由实现94%的智能任务分配

从进程到状态

2023年: 进程级隔离,简单的文件锁定 2024年: 状态持久化,智能的状态管理 证据: 子图状态管理实现96%的跨会话持久化

2024年编排器最佳实践

  1. 选择合适的架构: 根据项目复杂度和团队规模选择合适的编排器
  2. 混合方法: 结合Skill和MCP工具的优势,实现96%总体成功率
  3. 状态持久化: 实现跨会话的状态管理,提高可靠性
  4. 动态路由: 基于运行时条件智能分配任务,提升灵活性
  5. 专业化分工: 使用专业化Agent而非通用Agent,减少78%错误率

Core takeaway: Hard orchestration owns the control flow, state management, and error handling. Soft orchestration owns the prompts, skills, and context. This aligns perfectly with the "mostly software" insight—reliable agents are built on deterministic scaffolding with LLM intelligence at the edges. 2024年的演进显示,编排器正在从实验性工具向生产级平台转变,成功的架构结合了状态持久化、动态路由、混合方法和跨平台集成。

1.5 Book Structure

This book is organized into three parts:

Part I: Hard Orchestration — Low-level logic that agents cannot modify:

  • Ch2: Architecture—Five topologies from dual-agent to tree
  • Ch3: Roles—Who does what, persistent vs ephemeral
  • Ch4: Communication—send-keys to SQLite mail
  • Ch5: Fault Tolerance—1-layer to 4-layer protection
  • Ch6: Isolation—Four weapons for concurrency safety
  • Ch7: Deployment & Daemons—systemd, tmux, cron
  • Ch8: Rule Guard—Hard enforcement of constraints

Part II: Soft Orchestration — Skills and prompts that agents read:

  • Ch9: Prompt Engineering—Iron rules, MISSION, SPRINT
  • Ch10: Skill Systems—SKILL.md, templates, MCP tools
  • Ch11: Knowledge Accumulation—LEARNINGS, memory, experience
  • Ch12: Pipeline Orchestration—Multi-step workflows, quality gates

Part III: Practice & Evolution — Build and evolve:

  • Ch13: Antipatterns—Pitfalls across hard and soft
  • Ch14: Hands-On—Build a minimal orchestrator from scratch
  • Ch15: Evolution Roadmap—From scripts to autonomous systems

1.6 Source Projects and Licensing

This book analyzes the following open-source projects. All are used under their respective licenses with proper attribution:

Project Author License Repository
Tmux-Orchestrator Jedward23 MIT GitHub
agency-agents-zh jnMetaCode (translation), Michael Sitarzewski (original) MIT GitHub
Composio agent-orchestrator Composio, Inc. MIT GitHub
Overstory Jaymin West MIT GitHub

How this book uses source material:

  • Architecture descriptions and design patterns: Summarized and reinterpreted from project documentation and source code. These are ideas, not copyrighted expression.
  • Code examples in chapters: Unless explicitly marked as "adapted from [project]", all code blocks are illustrative pseudocode written for this book to demonstrate concepts. They are not verbatim copies of project source code.
  • Verbatim quotes: When exact text is quoted from a project's README, CLAUDE.md, or documentation, it is clearly attributed in the surrounding text.

This book itself is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).