General
What is SET?
SET is an orchestration framework that transforms a product specification into fully implemented, tested, and merged code — autonomously.
You write a detailed markdown spec (data model, pages, design tokens, auth flows, seed data). SET decomposes it into independent changes, dispatches parallel Claude Code agents into isolated git worktrees, runs deterministic quality gates, and merges the results.
Who is SET for?
Development teams and technical leaders who already use Claude Code and want to scale beyond single-agent, single-task workflows. If you're an architect managing a backlog of well-specified features, or a CTO evaluating how AI agents can own the full implementation-to-merge cycle, SET is the layer that makes that possible.
It assumes you can write a good spec — it handles everything after that.
Is this production-ready?
SET was built with itself over 79 days. These are the numbers:
The MiniShop benchmark delivers 6/6 changes merged, zero human intervention, in 1h 45m. CraftBrew (15 changes, 150+ files, 28 DB models) completed fully autonomously in ~6h.
How Is SET Different From...
...just using Claude Code?
Claude Code in 2026 is dramatically capable: native worktrees, Agent Teams (experimental, 3-5 agents), 26 hook events with 4 handler types, auto-memory, Plan mode, subagents with worktree isolation, Agent SDK, and MCP. SET is built on top of all these primitives.
| Claude Code (alone) | SET | |
|---|---|---|
| Scope | One task or one team session | Full spec → decomposed into N parallel changes |
| Planning | Plan mode (freeform, ephemeral, read-only) | OpenSpec: persistent versioned artifacts with traceability |
| Quality | Hooks can run checks (DIY wiring) | Structured gate pipeline: build → test → E2E → review |
| Merging | Manual git merge, no enforcement | Automated merge queue + conflict resolution + post-merge verification |
| Recovery | Session dies, restart manually | Sentinel detects crash in 30s, graduated escalation |
| Memory | Auto-memory (flat file at startup) | Semantic memory graph with topic recall at 4 lifecycle points |
| State | Session-scoped, lost on restart | Atomic JSON, resumable across restarts |
| Coordination | Agent Teams: one session, shared task list | Cross-session, cross-machine orchestration with sentinel |
Claude Code gives you excellent building blocks. SET gives you the assembled machine. You could build this yourself with the Agent SDK and hooks — SET is the battle-tested implementation.
What SET doesn't have: Claude Code's 101+ plugin marketplace, Agent SDK for custom development, deep IDE integration (VS Code, JetBrains).
...Claude Code Agent Teams?
Agent Teams = parallelism within one session. A lead assigns subtasks to teammates (3-5 recommended); they share context via task list + mailbox. Important: teammates share the working directory by default — two teammates editing the same file leads to overwrites. Worktree isolation available via subagent config but not automatic. Still experimental: no session resumption, task status lag, one team per session, lead is fixed.
SET = parallelism across sessions, machines, and time. A planner decomposes a full spec into a dependency DAG, dispatches each to its own long-running agent, manages quality gates and the merge pipeline. Good for shipping an entire product.
They're complementary. SET can use Agent Teams inside each worktree while managing cross-change orchestration externally.
What Agent Teams does better: Zero-setup parallelism. No framework installation needed — one environment variable starts a team. For a quick parallel task within a single feature, Agent Teams is faster to reach for.
...Cursor's parallel agents?
Cursor 3 (April 2026) has two parallelism modes: local worktree agents (up to 8, via git worktree add) and cloud Background Agents (no cap, credit-bound, each in an isolated AWS Ubuntu VM). You can close your laptop with cloud agents.
What Cursor lacks vs SET:
- No spec decomposition — agents launched from ad-hoc prompts, no dependency ordering
- ~30% PR merge rate — Cursor's own published stat: ~30% of generated PRs pass CI and merge without intervention. SET MiniShop: 100%.
- No inter-agent coordination — multiple agents have no awareness of each other
- No supervision — no sentinel, no crash recovery, no stall detection
What Cursor does better: Cloud execution (agents work while you sleep), local worktree agents with zero framework setup, polished IDE, multi-model support, CI auto-fix cookbook. Cost caveat: cloud agents ~$5-15 per PR, users report $2000+ in two days with heavy use.
...Devin?
Devin is an autonomous AI engineer — takes a task, works in a sandboxed VM, creates PRs. Can run multiple concurrent sessions (each in its own VM), but sessions are independent with no coordination between them.
| Devin | SET | |
|---|---|---|
| Execution | Cloud VM sandbox | Local worktrees |
| Parallelism | Independent sessions (no coordination) | Coordinated parallel via orchestrator + merge queue |
| Testing | Runs tests if they exist (ad-hoc) | Structured gate pipeline (build → test → E2E → review) |
| Integrations | Excellent Slack/Jira/GitHub | CLI + web dashboard + MCP |
| Merge | Opens PR, relies on CI | Integration gates enforced before merge |
What Devin does better: Slack integration is best-in-class — assign a task from Slack, get a PR back. Cloud VM means zero local setup. The UI for watching agent work is polished. For simple, independent tasks (migrations, CRUD, test writing), Devin is smoother than setting up SET orchestration.
What SET does better: Multi-change coordination, spec traceability, pre-merge quality gates, sentinel supervision, persistent memory, and deterministic merge ordering.
...Kiro (Amazon)?
Kiro (GA Nov 2025) is the closest philosophical match: a spec-driven IDE with formal EARS requirements, design docs, and task lists. Built on VS Code, powered by Bedrock. Supports Claude, DeepSeek, Qwen, MiniMax via auto-router.
Kiro's genuine innovations:
- EARS spec notation — formal SHALL statements with Requirements-First or Design-First entry points. Specs stay synced with code.
- Property-Based Testing — extracts testable properties from specs, generates hundreds of random inputs, shrinks to find minimal failing cases. Auto-fixes. Genuinely novel.
- 10 hook trigger types — File Create/Save/Delete, Prompt Submit, Agent Stop, Pre/Post Tool Use, Pre/Post Task Execution, Manual.
- Autonomous Agent (preview) — background agent with 3 sub-agents (planner, writer, verifier). Up to 10 concurrent tasks across repos. Opens PRs, never merges. Learns from code review.
- Multi-model — Claude, DeepSeek, Qwen, auto-router. Pricing: Free→$20→$40→$200/mo.
The differences:
- Kiro's Autonomous Agent handles 10 tasks but opens PRs without merging. SET manages the full merge pipeline with gates.
- Kiro has PBT (random test generation from spec properties). SET has deterministic gates (exit codes).
- Kiro is multi-model. SET is Claude-only.
What Kiro does better: PBT is genuinely novel, Autonomous Agent handles 10 concurrent tasks, multi-model support, lower barrier to entry, 10 hook trigger types. What SET does better: Spec decomposition into DAGs, coordinated merge with gates, sentinel supervision, semantic memory, design integration.
...Augment Intent?
Augment Intent (public beta, Feb 2026, macOS only) is architecturally the most similar tool to SET:
- Living Specifications — self-maintaining spec docs that auto-update as agents work. Changes propagate to active agents.
- Coordinator/Specialist/Verifier — 6 specialist personas (Investigate, Implement, Verify, Critique, Debug, Code Review).
- Git worktree isolation — each task creates a "Space" with its own branch and worktree.
- Multi-model — runs Claude Code, Codex, OpenCode. Mix models per task.
- No agent cap — "Run as many agents as the task needs."
What Augment does better: Multi-model mixing (Opus for planning, Sonnet for coding), living specs that auto-update, specialist agent personas, polished desktop UX.
What SET does better: Deterministic gates (exit codes, not agent judgment), proven production track record (200+ runs), Linux support, web dashboard, design integration, persistent semantic memory, full merge pipeline. Augment is macOS-only beta; SET is battle-tested in production.
...Roo Code, Aider, Cline, Windsurf?
All excellent single-agent tools — each with genuine strengths SET lacks:
- Roo Code — Configurable modes (Architect/Code/Debug/custom), sequential delegation ("Boomerang" pattern). Model-agnostic. Better at: easy custom mode creation, any-LLM support, open source community.
- Aider — CLI pair programmer. Better at: best-in-class git integration (auto-commit with meaningful messages, full undo), any-model support, repo map (tree-sitter), cost-efficient token usage, edit format innovation.
- Cline — VS Code extension. Better at: best-in-class MCP marketplace, full transparency (every tool call visible), any-model support, granular approval workflow.
- Windsurf — AI IDE (acquired by OpenAI ~$3B). Cascade engine had strong within-session context tracking. Current status post-acquisition uncertain.
SET is not a better version of these tools — it's a different category. These are the developer's hands. SET is the sprint board, CI pipeline, and release manager.
...Copilot Coding Agent, OpenHands, Composio?
- GitHub Copilot Coding Agent — Assign a GitHub Issue, Copilot creates a branch, codes, runs CI, self-reviews, opens a PR. Cloud-hosted. Better at: zero-setup for GitHub users, largest distribution, GitHub-native workflow. Lacks: no multi-agent coordination (each agent independent), no spec decomposition, no pre-merge gate pipeline.
- OpenHands — Strongest open-source single-agent runtime. Docker-sandboxed, multi-model, strong SWE-bench results (50%+). Provides agent execution, not orchestration workflow. No parallel coordination, no gates, no merge pipeline.
- Composio — Correction: Composio is a tool-integration platform (250+ API integrations for agents), NOT an agent orchestrator. It provides middleware for CrewAI, LangGraph, etc. to call external tools. Different category from SET.
- GPT-Engineer / Lovable — App builders for non-developers. Prompt to MVP. Different category entirely.
Capability matrix: SET vs. the landscape
| Tool | Parallel Agents | Isolation | Specs | Gates | Merge Pipeline | Supervisor | Cloud | Any LLM |
|---|---|---|---|---|---|---|---|---|
| SET | ✓ | Worktrees | OpenSpec | 9 gates | ✓ | Sentinel | — | Claude |
| Augment Intent | Coordinator | Spaces | Living specs | Verifier agent | N/A | — | — | ✓ |
| Claude Code | Experimental | Subagents | — | Hooks (DIY) | — | — | — | Claude |
| Cursor | 8 local + cloud | WT + VMs | — | ~30% merge | — | — | ✓ | Multi |
| Devin | Independent | Sandbox VM | — | Ad-hoc tests | — | — | ✓ | Proprietary |
| Kiro | 10 tasks (preview) | — | EARS + PBT | PBT + hooks | Opens PRs | — | — | Auto-router |
| Copilot Agent | Independent | Cloud VM | — | CI + self-review | — | — | ✓ | GPT/Claude |
| Roo Code | Modes | — | — | — | — | — | — | ✓ |
| Aider | — | — | — | — | — | — | — | ✓ |
| Cline | — | — | — | — | — | — | — | ✓ |
| OpenHands | — | Docker | — | — | — | — | ✓ | ✓ |
SET's unique position: the combination of structured specs + deterministic gates + merge pipeline + sentinel. Other tools excel where SET doesn't: cloud execution (Cursor, Devin, Copilot), model flexibility (Aider, Kiro, Cline), living specs (Augment Intent), PBT (Kiro), IDE integration (Kiro, Cursor).
OpenSpec
What is OpenSpec and why not just use a prompt?
OpenSpec is a structured, artifact-driven methodology. Instead of a conversation, work is expressed as a sequence of structured documents that serve as contracts between planner, implementer, and verifier:
- Proposal — Why we're doing this (problem, impact)
- Specs — What exactly must be built (WHEN/THEN acceptance criteria)
- Design — How we'll build it (decisions, tradeoffs)
- Tasks — Implementation checklist (
[REQ: requirement-name]traceability)
Why not just a prompt?
- Prompts drift. Agents interpret, improvise, skip. Specs have explicit IN SCOPE / OUT OF SCOPE.
- Prompts can't be verified. How do you check "build a webshop"? OpenSpec checks every requirement against tasks against code.
- Prompts don't compose. 5 parallel agents need divided scope. Delta specs assign specific requirements to specific changes.
- Prompts leave no record. OpenSpec archives the full decision chain for future reference.
How is this different from Claude Code's Plan mode?
Plan mode is a thinking step. OpenSpec is a workflow system.
| Plan Mode | OpenSpec | |
|---|---|---|
| Output | Freeform text | Structured artifacts (proposal, specs, design, tasks) |
| Persistence | Disappears after session | Committed to repo, archived after completion |
| Traceability | None | Every task traces to a requirement |
| Verification | None | Automated: completeness, correctness, coherence |
| Scope | Trust | Explicit IN SCOPE / OUT OF SCOPE |
| Multi-agent | Not designed for it | Delta specs assign scoped work to each agent |
Plan mode helps a single agent think. OpenSpec gives a system of agents structured contracts to work against and verify.
What are delta specs?
When a change is created (e.g., add-user-auth), its spec files are delta specifications — the incremental requirements this change introduces, using ADDED / MODIFIED / REMOVED markers.
After merge, delta specs sync into main specs — the single source of truth. This means:
- Each change only describes what it changes, not the entire system
- Multiple changes can touch the same capability without conflicting
- Main specs evolve incrementally as changes merge
- Full history preserved in archived changes
What does the artifact workflow look like?
Each artifact depends on the previous. The schema enforces ordering — you can't create tasks before design, because design decisions inform task structure.
Fast-track: /opsx:ff generates all artifacts in one pass.
Orchestration
How does parallel execution actually work?
- Decompose — Planner reads your spec, creates a dependency DAG of independent changes
- Dispatch — For each ready change: create worktree, generate context, bootstrap env, start Ralph Loop
- Monitor — Every 15 seconds: check progress, detect stalls, track budgets
- Verify — Agent reports "done" → run gate pipeline (build → test → E2E → review)
- Merge — Sequential merge queue with conflict resolution and post-merge verification
- Sync — After each merge, all running worktrees pull main immediately
- Replan — After all changes merge, check for uncovered requirements, generate new changes
Why git worktrees?
True filesystem isolation without the overhead of cloning:
- Each agent has its own working directory — no file conflicts during parallel development
- Each agent has its own branch — clean, independent git history
- Worktrees share the same
.gitdirectory — no disk waste from full clones - Independent dep installs, test runs, and builds — no interference
This is fundamentally different from agents "coordinating" via messages in a shared workspace — that approach breaks down when agents edit the same files simultaneously.
What happens when agents conflict?
Multi-layer conflict resolution:
- Preventive — Dependency DAG orders cross-cutting changes sequentially. Profile-defined cross-cutting files are serialized.
- Generated files — Lockfiles, build artifacts auto-resolved, then regenerated (
pnpm install). - Real conflicts — Source code conflicts cause
merge-blocked. Sentinel investigates, redispatches, or escalates. - Post-merge sync — All running worktrees pull main immediately after every merge.
In practice: CraftBrew (15 changes) had 4 conflicts — all auto-resolved. MiniShop (6 changes): zero conflicts.
What is the sentinel?
An AI supervisor that watches orchestration and handles what goes wrong. Separate agent from the orchestrator — supervisor/subordinate pattern.
| Event | Sentinel Action |
|---|---|
| Agent crash | Diagnose from logs, restart or escalate |
| Agent stall (>120s) | Investigate cause, attempt recovery |
| Periodic checkpoint | Auto-approve (routine) or escalate (unusual) |
| Orchestration complete | Generate summary report |
| Budget overrun | Pause agent, escalate |
Cost: typically 5-10 LLM calls per entire run. Saves hours of wasted compute by catching crashes that would otherwise silently waste an overnight run.
Quality & Verification
What are integration gates?
Deterministic quality checks before merging. Exit codes, not LLM judgment.
| Gate | What | How |
|---|---|---|
| build | Types check, code compiles | tsc --noEmit, next build |
| test | Unit/integration tests | vitest run, pytest |
| e2e | Browser tests | playwright test |
| scope_check | Files match scope | Changed files validated against declared scope |
| test_files | Tests present | Test files exist for implemented code |
| review | Code quality, security | Claude review — no CRITICAL findings |
| rules | Custom compliance | Profile-defined rules (naming, patterns) |
| spec_verify | Requirements addressed | All REQ-IDs have tasks |
| smoke | Post-merge sanity | Custom command (runs after merge) |
If a gate fails, the agent receives the error and retries. Self-healing. No human needed.
Why not just trust the LLM's judgment?
Because LLMs hallucinate confidence. "Looks good to me" from a code review is not the same as vitest run returning exit code 0.
MiniShop's 5 gate retries — all self-healed from real bugs an LLM review would have missed:
- Missing test file → test gate caught it
- Jest config import error → build gate caught it
- Playwright auth test failures ×3 → agent fixed to match actual behavior
- Post-merge type mismatch → agent synced main
- Cart test race condition → agent added
waitForSelector
An LLM review would have said "looks good" for at least 3 of these.
How do you measure output quality across runs?
Structural convergence. Run the same spec twice independently and measure similarity:
Remaining divergence is stylistic (variable naming, CSS order), not structural. The spec + template system produces deterministic architecture even with non-deterministic LLMs.
Memory & Learning
How does persistent memory work?
Hook-driven memory (shodh-memory) captures and injects context automatically. Agents don't need to save explicitly.
| Hook | When | What |
|---|---|---|
| Warmstart | Session start | Loads relevant memories as context |
| Pre-tool | Before each tool call | Injects topic-based recall |
| Post-tool | After Read/Bash | Surfaces past experience |
| Save | Session end | Extracts new insights from conversation |
Key finding: zero voluntary saves across 15+ sessions. Agents don't save on their own — the hook infrastructure is essential.
Why does memory matter for orchestration?
Without memory, every agent rediscovers conventions, repeats mistakes, wastes tokens.
Learnings from failed runs convert to rules, enforced in the next run. The system improves with every orchestration.
Architecture & Extensibility
What is the plugin system?
Three layers separate concerns:
- Layer 1 — Core (
lib/set_orch/): Abstract orchestration. Dispatcher, monitor, merger, gates. No project-specific logic. - Layer 2 — Modules (
modules/): Project-type knowledge.modules/web/knows Next.js, Playwright, Prisma. - Layer 3 — External: Your own plugins via
pip install+ entry_points.set-project-fintechcould add IDOR scanning, PCI compliance.
Each module implements the ProjectType ABC: test detection, forbidden patterns, verification rules, custom gates, merge strategies, planning rules. New project types don't touch core — they extend it.
Can I use this without Claude Code?
No. SET is built specifically for Claude Code: worktrees, hooks, MCP, skills, subagents. This is by design — SET doesn't abstract over LLMs. It leverages Claude's strengths fully: 200K+ context, native tool use, code understanding.
Depth beats breadth. Abstracting to a lowest-common-denominator API would sacrifice these capabilities for theoretical portability.
Can this run on-premise?
The infrastructure is designed for it. SET is self-hosted — no SaaS dependency. The orchestration engine, gates, and state management have no cloud dependency. Only the LLM endpoint needs configuration.
When on-premise Claude models become available for regulated industries (banks, defense, government), SET's architecture works unchanged.
How does design system integration work?
- Export from design tool →
design-system.md(tokens) +design-brief.md(visual specs) - Dispatcher scope-matches relevant pages to each change → per-change
design.md - Agent receives exact hex colors, font names, component layouts
- Review gate checks design compliance — token mismatches flagged
Eliminates the "shadcn defaults everywhere" problem. Agents implement your brand, not a generic component library.
What SET Doesn't Do (Yet)
Honest gaps where competitors are ahead
| Gap | Who does it better | Notes |
|---|---|---|
| Cloud execution | Cursor BGA, Devin, Copilot | SET requires a running local machine. Cloud agents work while you sleep. |
| Model flexibility | Aider, Roo Code, Cline | SET is Claude-only. No GPT, Gemini, or local model support. |
| IDE integration | Kiro, Cursor, Windsurf | SET is CLI + web dashboard. No VS Code/JetBrains plugin. |
| Zero-setup | Copilot, Cline, Cursor | SET requires pip install, project init, config. Others are install-and-go. |
| Issue tracker → PR | Copilot Coding Agent | SET works from specs, not from Jira/Linear/GitHub Issues. |
| Slack trigger | Devin | Can't trigger SET from Slack. |
| File-event hooks | Kiro | SET hooks are at orchestration level, not IDE file-save events. |
| MCP marketplace | Cline | SET has a custom MCP server, not a marketplace for third-party tools. |
| Quick prototyping | Lovable, Cursor, Claude Code | SET's spec-driven workflow adds upfront overhead. For a quick prototype, Claude Code alone is faster. |
| Spec writing | — | The spec is a bottleneck: orchestration quality is bounded by spec quality. Writing a good spec takes effort. |
These are conscious trade-offs, not oversights. SET optimizes for orchestration depth over integration breadth. The overhead cost is real — SET is not for quick prototypes. It's for when you already know what to build and want deterministic, reproducible implementation.
Practical
What does a spec need to contain?
Your spec is the single most important input. Required:
- Project overview — What, who, tech stack
- Data model — Entities, fields, relationships
- Page layouts — Sections, columns, components
- Component behavior — Click, hover, state changes
- Auth & roles — Permissions, protected routes
- Seed data — Realistic initial data
- Design tokens — Brand colors (hex), fonts, spacing
- E2E test expectations — Critical flows
Each requirement needs a REQ-ID (REQ-AUTH-01) and at least one WHEN/THEN scenario.
How long does an orchestration take?
| Project | Changes | Wall time | Tokens | Interventions |
|---|---|---|---|---|
| Micro-Web (simple) | 3-4 | ~45m | ~1M | 0 |
| MiniShop (e-commerce) | 6 | 1h 45m | 2.7M | 0 |
| CraftBrew (complex) | 15 | ~6h | ~11M | 0 |
Token scaling is super-linear (4x tokens for 2.5x changes) because later changes require more context from merged code.
What are the self-healing capabilities?
Gate-level:
- Test failure → agent reads error, fixes, reruns
- Build error → agent reads type error, fixes it
- E2E failure → agent sees Playwright trace, updates selectors
- Type mismatch → agent syncs main, resolves
Sentinel-level:
- Agent crash → detected in 30s, auto-restart
- Agent stall → watchdog escalates: warn → restart → redispatch → fail
- Orphaned worktree → cleaned up on restart
How do I get started?
# Install SET
pip install -e .
pip install -e modules/web
# Initialize a project
set-project init --name my-app --project-type web --template nextjs
# Write your spec (docs/spec.md)
# Start orchestration
curl -X POST http://localhost:7400/api/my-app/sentinel/start \
-H 'Content-Type: application/json' \
-d '{"spec":"docs/spec.md"}'
Or step-by-step:
/opsx:explore → Think through the problem /opsx:ff change → Generate all artifacts /opsx:apply → Implement /opsx:verify → Check /opsx:archive → Done
The Big Picture
What problem does SET actually solve?
The gap between "AI can write code" and "AI can ship software."
Writing code is 20% of the work. The other 80%: decomposing requirements, coordinating parallel work, handling conflicts, running quality checks, managing merge order, recovering from failures, learning from mistakes.
SET automates the 80%.
Why specs instead of prompts?
"Build a webshop" produces a different webshop every time.
"Build a webshop with these 28 data models, these 12 pages, these design tokens, these auth rules, these seed data records, and these E2E scenarios" produces the same webshop every time.
The spec is the determinism layer. MiniShop: 83/100 structural convergence score across independent runs (measured by set-compare). Without specs, convergence approaches 0%.
How is this different from just running CI/CD?
CI/CD validates code after someone creates a PR. SET manages the entire pipeline before the PR exists:
| CI/CD | SET |
|---|---|
| PR created → tests → review → merge | Spec → decompose → dispatch → gates → merge → replan |
CI/CD assumes someone creates the PR. SET creates the PRs, validates them, merges them, and identifies what's still missing.
Why not abstract over multiple LLMs?
Because depth beats breadth. SET leverages Claude-specific capabilities: 200K+ context, native tool use, worktree support, hooks, MCP. Abstracting to a lowest-common-denominator API would sacrifice these for theoretical portability.
SET bets on Claude getting better — and compounds that bet.
What's the competitive moat?
The combination. No other tool provides all six:
Structured Specs
Traceable requirements with WHEN/THEN scenarios, not prompts
Parallel Agents
Isolated worktrees, dedicated agents, across machines
Quality Gates
Deterministic: exit codes, not vibes. Build, test, E2E, review.
Merge Pipeline
Automated conflict resolution, post-merge verification
Sentinel Supervision
Crash recovery in 30s, stall detection, budget tracking
Persistent Memory
Cross-session learning, convention compliance, continuous improvement
Most tools have 1-2 of these. Closest competitors have 2-3. The value is in the integration — the six capabilities reinforce each other. Structured specs enable meaningful gates. Gates enable autonomous merging. Memory enables learning. The sentinel enables unattended operation.