SET

Ship Exactly This.

Give it a spec. Get merged features.

Autonomous multi-agent orchestration for Claude Code. Greenfield or brownfield. Full app or single module. Every change planned through OpenSpec, verified by quality gates, merged automatically.

We didn't invent agents. We taught them to work together.

14:32:01 ✓ REQ-AUTH-001 admin registration — PASS

14:32:03 → dispatching cart-system to worktree...

14:32:08 ✓ AC-4: WHEN user adds to cart THEN total updates — VERIFIED

14:32:12 ⚡ gate: build PASS (2.1s) test PASS (8.3s) e2e PASS (12s)

14:32:18 ✓ cart-system merged to main (ff-only)

14:32:22 → phase 3: dispatching admin-products, orders-checkout

14:32:35 ✓ REQ-PROD-003 variant management — PASS

14:32:41 ⚡ spec coverage: 28/32 requirements — 87%

14:32:45 → auto-replan: 4 remaining requirements

14:32:52 ✓ AC-12: WHEN admin edits product THEN changes persist — VERIFIED

14:33:01 ✓ admin-products merged to main (ff-only)

14:33:05 ⚡ spec coverage: 32/32 requirements — 100%

14:33:08 ✓ orchestration complete — 6/6 changes merged, 0 interventions

14:32:01 ✓ REQ-AUTH-001 admin registration — PASS

14:32:03 → dispatching cart-system to worktree...

14:32:08 ✓ AC-4: WHEN user adds to cart THEN total updates — VERIFIED

14:32:12 ⚡ gate: build PASS (2.1s) test PASS (8.3s) e2e PASS (12s)

14:32:18 ✓ cart-system merged to main (ff-only)

14:32:22 → phase 3: dispatching admin-products, orders-checkout

14:32:35 ✓ REQ-PROD-003 variant management — PASS

14:32:41 ⚡ spec coverage: 28/32 requirements — 87%

14:32:45 → auto-replan: 4 remaining requirements

14:32:52 ✓ AC-12: WHEN admin edits product THEN changes persist — VERIFIED

14:33:01 ✓ admin-products merged to main (ff-only)

14:33:05 ⚡ spec coverage: 32/32 requirements — 100%

14:33:08 ✓ orchestration complete — 6/6 changes merged, 0 interventions

$ git clone https://git.setcode.dev/root/set-core.git && cd set-core && ./install.sh

// after install:

1. Open http://localhost:7400 — dashboard starts automatically

2. Open Claude Code in set-core dir, type: "run a micro-web E2E test"

3. Start sentinel from the dashboard — or tell Claude: /set:start

4. Watch the dashboard — agents decompose, implement, verify, merge

5. When done, tell Claude: "start the application that was just built"

6. Ready for your own project: set-project init --project-type web

GitHub GitLab

// the run · 2 min

An agent built a webshop end-to-end. Watch the actual run.

50 agent sessions. 0 human interventions. 6 OpenSpec changes merged. 4h 28min wall time, sped up ~135× to fit in two minutes. Not a hero run — a typical day's orchestration.

set-core minishop E2E run — click to watch on YouTube

watch on YouTube →

// six pillars

The architecture behind autonomous orchestration

Six principles that make it work. Each one learned the hard way — through failures that cost hours of compute, corrupted branches, and wasted overnight runs.

SPECIFY DECOMPOSE EXECUTE SUPERVISE VERIFY LEARN

SPECIFY

Structured input,
not prompts

DECOMPOSE

Intelligent planning,
not guessing

EXECUTE

Structured implementation,
not free rein

SUPERVISE

Three-tier supervision,
not babysitting

VERIFY

Deterministic quality,
not vibes

LEARN

Every run improves
the next

// pillar 01 — specify

Structured input, not prompts

Output quality depends on input quality. 90% of agent failures are underspecification.

openspec_workflow

Structured artifacts: proposal → design → spec → tasks → code. Acceptance criteria (WHEN/THEN), requirement IDs (REQ-xxx), end-to-end traceability. Agents implement against the spec, not their imagination.

design_bridge

Figma Make → set-design-sync → per-change design.md with scope-matched tokens. Each agent gets only the colors, fonts, layouts for its pages. Not "make it look nice" — exact hex values, exact spacing. Part of the built-in web project type — the battle-tested default.

spec_flexibility

Full 30-page spec with Figma design. Or 3 requirements for your existing codebase. Or a single task description. The pipeline scales from a sentence to a specification. /set:write-spec generates structured specs interactively.

"We told agents 'build a cart feature.' Result: no price calculation, no empty-cart state, no persistence. With OpenSpec: 8 requirements, 8 implemented."

Waterfall took 8 months. This takes 8 hours.

The principle hasn't changed: output quality depends on input quality. A detailed spec used to mean months of upfront planning. Now it means hours of orchestrated agents building exactly what you described.

You are the product owner. The agents are the dev team. The spec is the sprint backlog. The better the spec, the better the result.

your_spec

Business requirements, acceptance criteria (WHEN/THEN), technical constraints, dependency listing, seed data conventions.

our_templates

Framework boilerplate, build config, test setup, linting rules, conventions. You say what. Templates handle how.

// pillar 02 — decompose

Intelligent planning, not guessing

One big task fails. Many small ones succeed. Max 6 requirements per change — above that, failure rate spikes.

dependency_dag

Spec → requirements → dependency graph → phased execution. Changes that don't depend on each other run in parallel. Changes that do, wait. The planner does this automatically — no manual sprint planning.

parallel_worktrees

Multiple Claude agents in isolated git worktrees. Real branches, real merges. No containers, no VMs — just git. Even with a single change, the worktree provides isolation — your main branch stays clean until gates pass. With multiple changes, they run in parallel without interference.

complexity_aware_allocation

S/M/L sizing per change. Token budgets (S: 2M, M: 5M, L: 10M). Model selection per complexity. These thresholds come from 100+ production runs — not guesswork.

"CraftBrew auth change hit 443K tokens (222% of context window) and failed. After splitting into smaller changes: each completed in 15-20 minutes."

// pillar 03 — execute

Structured implementation, not free rein

Agents don't get a prompt and good luck. They get structured artifacts, project-type conventions, and iterative loops with progress tracking.

ralph_loop

Iterative agent development cycle: proposal → design → spec → tasks → code. Not a single-shot prompt — multiple iterations with stall detection, done criteria, and context pruning between turns. Single-shot gets you 70% done. The loop gets you to merge.

project_types

The built-in web project type: Next.js, Playwright, Prisma templates. Agents work into existing structure, not from scratch. Convention enforcement, route groups, colocation rules. Build your own for any stack — fintech, healthcare, CLI, API.

agent_tooling

Every agent gets: scoped proposal, task list with REQ-IDs, design.md with exact tokens, MCP tools for memory and team sync. Plus: token budget awareness, progress-based trend detection, and auto-pause when stuck or over budget.

"Single-shot prompts get ~70% done and stop. The Ralph Loop iterates: write code, run tests, read errors, fix, repeat. That last 30% is where the value is."

// pillar 04 — supervise

Three-tier supervision, not babysitting

The goal is always the happy path. But when things break — and they will — recovery must be fast, thorough, and automatic.

sentinel_supervisor

3-tier decision model: sentinel → orchestrator → agents. Each tier handles its own failure mode. Agents handle code errors. Orchestrator handles workflow errors. Sentinel handles infrastructure — crashes, disk, deadlocks. 30s detection, auto-recovery.

watchdog_intelligence

Context-aware stall detection. pnpm install taking 90s with no stdout? Grace period. Prisma migration running? Extended timeout. Graduated escalation: warn → restart → rebuild → give up. Not "no output = dead."

team_sync

Multi-agent messaging. Broadcast status, avoid file conflicts, coordinate dependencies. The orchestrator sees what everyone is doing — and intervenes when needed.

web_dashboard

Real-time monitoring at localhost:7400. Step progress, gate results, token charts, agent terminal, sentinel decisions, learnings — every tab is live. Start orchestration from the browser. Not a CLI afterthought — a proper operations center.

"We aim for clean runs. But crashes happen — disk fills up, network drops, agent stalls. The sentinel detects in 30 seconds, diagnoses, restarts. Before it existed, we lost 3 overnight runs. Now: zero."

// pillar 05 — verify

Deterministic quality, not vibes

Exit codes, not LLM judgment. You can't talk your way past a failing test.

quality_gates[7]

Test, build, E2E, lint, review, spec coverage, smoke. Sequential pipeline — fast gates first. If Jest fails in 8s, you don't wait 45s for Playwright. Exit codes decide pass/fail. BDD traceability binds REQ-IDs to tests.

self_healing

Gate fails → agent reads error → fixes → re-runs gate. Not "retry 3 times and give up." The agent diagnoses. MiniShop: 5 gate failures, 5 autonomous fixes — including IDOR vulnerabilities caught and patched without human review.

deterministic_output

3-layer templates + set-compare scoring. Run the same spec twice: 87% structural overlap on micro-web, 83% on minishop. Schema equivalence: 100%. Convention compliance: 100%. The remaining divergence is stylistic, not structural.

spec_coverage

"Tests pass" does not mean "spec is implemented." The verify gate checks every REQ-ID has corresponding code. If 28/32 requirements are covered, auto-replan kicks in for the remaining 4. Doesn't stop until 100%.

"Early CraftBrew: LLM review let through a 'TODO: implement later' that broke checkout. After that — exit codes only. Deterministic gates can't be gamed."

// pillar 06 — learn

Every run improves the next

The real value shows from run #2 onward. Every error occurs only once.

cross_run_learnings

Gate failures become planning rules. set-harvest extracts framework-level fixes from 100+ runs across 4 projects. Each run is smarter than the last — not by prompting better, but by codifying what went wrong into rules.

persistent_memory

Hook-driven cross-session recall. Agents learn from each other. Shared across worktrees. In 15+ sessions, agents made 0 voluntary memory saves. Zero. So we built 5-layer hook infrastructure that captures everything automatically.

template_system

Telling 5 agents "create a Next.js project" produces 5 different directory structures. Templates produce one. 3-layer system: core → module → project. Reduced file structure divergence from 63% to 0%.

"The first run always reveals problems. The point: every error occurs only once, because the system codifies the fix into rules, templates, or gates."

// the pipeline

From spec to merged code — fully autonomous

01 spec + design SPECIFY

input: markdown spec + figma design. Use /set:write-spec for interactive spec generation, set-design-sync to extract Figma tokens.

02 digest DECOMPOSE

Digest — domains, requirements, acceptance criteria

digest: spec parsed into structured requirements, domains, and dependency graph. Every acceptance criterion tied to a REQ-ID.

03 triage SPECIFY

triage: ambiguities flagged during digest get resolved — automatically by the planner, or interactively by you. Nothing proceeds until every AMB has a decision.

04 orchestrate EXECUTE

orchestrate: independent changes run in parallel worktrees. Agents implement iteratively. Sentinel monitors everything.

05 verify VERIFY

verify: 7 quality gates per change — test, build, e2e, lint, review, spec coverage, smoke. Exit codes, not judgment.

06 ship LEARN

ship: verified code merges to main. Result: running application built from your spec. Zero intervention.

spec.md ─► digest ─► triage ─► orchestrate ─► verify ─► ship

1,500+ commits · 65K core LOC · 376 specs · 860+ change artifacts · 200+ E2E runs

// works everywhere

Not just "build me an app from scratch"

The pipeline scales from a single feature to a full application. Your existing codebase, your workflow.

greenfield

Full app from spec + design. 6-15 changes, parallel agents, zero intervention.

MiniShop: 6/6 merged, 1h 45m, 38 unit tests, 32 E2E tests, 0 human interventions.

brownfield

Your existing codebase. Add features, refactor modules, fix technical debt. The pipeline reads your code first.

set-core itself is built with SET — 376 specs, 1,500+ commits, every one through OpenSpec.

isolated_unit

One module, one feature, one fix. Single change, full gate pipeline. Same quality guarantees at any scale.

"Add 3 API endpoints with auth to my existing Next.js project." One spec, one run, done.

// see it run

A real agent session — spec to merged code

Claude agent session — debugging, testing, and fixing code autonomously

// proof

We treat determinism as an engineering problem

100+ runs, 4 project types, set-compare scores every pair. These are measurements, not claims.

1,500+ commits

83-87% convergence

7 gates per change

100+ E2E runs

0 overnight failures
(with sentinel)

challenge	approach	result
output divergence	3-layer template system + `set-compare`	87% micro-web · 83% minishop · 4 project types
convention compliance	route groups, colocation, naming rules	100% across all runs
quality roulette	7 programmatic gates (exit codes)	deterministic
hallucination	OpenSpec artifacts + acceptance criteria	spec-verified
spec drift	coverage tracking + auto-replan	100% coverage
failure recovery	issue pipeline (detect → diagnose → fix)	auto-recovery
agent amnesia	hook-driven memory (infrastructure)	100% capture

It doesn't retry. It investigates.

The sentinel doesn't blindly retry failed gates. It reads logs, traces root causes, and dispatches targeted fixes. Environment misconfigured? It reconfigures. Dependency conflict? It resolves. Bug in SET's own code? It patches set-core and commits the fix — so the same failure never happens twice.

Detect → investigate → fix → verify → learn. Permanent fixes, not temporary workarounds.

sentinel — global issues (live)

Sentinel Issues Dashboard — real self-healing pipeline across 100+ orchestration runs

Real issue tracker from 100+ orchestration runs. Every resolved issue was fixed autonomously.

See where every minute went.

The activity timeline places every gate, every LLM call, every agent session on a real-time axis. Click any implementing span to drill into its per-tool execution, LLM wait time, sub-agent invocations, and the longest individual operations with their command previews.

No more "the pipeline is busy" black box. Agent work, verifier loops, and sub-agent dispatches are all accounted for — so you can actually see which part of the run is slow, and why.

activity — timeline + drilldown

Top: full run timeline with every change as a row. Bottom: one implementing span expanded — tools, LLM calls, sub-agents, and the longest operations.

// commands

40+ tools. One workflow.

Slash commands in Claude Code, CLI tools in your terminal. Everything composes.

use_set // your project

set-project init deploy to project

/set:write-spec interactive spec gen

/set:decompose spec → execution plan

/set:start start orchestration

set-design-sync Figma → tokens

/set:audit health check

extend_set // plugins

modules/web/ built-in web type

ProjectType ABC base class

entry_points pip plugin system

CoreProfile inherit universal rules

planning_rules.txt domain patterns

templates/ scaffold per stack

develop_set // contribute

/opsx:new structured change

/opsx:apply implement tasks

/opsx:verify check before merge

set-harvest adopt E2E fixes

set-compare measure divergence

set-memory persistent recall

Plus: set-new, set-work, set-merge, set-close (worktrees) · /set:status, /set:msg, /set:inbox (team sync) · /set:todo, /set:loop, /set:push (workflow)

// ecosystem

Build your own project type

SET ships with a web project type (Next.js, Playwright, Prisma) battle-tested across 100+ runs. That's the default — but the real power is building your own.

modules/web/

Next.js, Playwright, Prisma. 100+ runs across micro-web, minishop, craftbrew. Per-change E2E, BDD traceability, convention enforcement.

E2E runners

Scaffold → init → register → sentinel start. One script per project. run-micro-web.sh, run-minishop.sh, run-craftbrew.sh. Repeatable validation.

your_project_type/

IDOR checks for fintech. HIPAA for healthcare. Your gates, your conventions, your templates. pip-installable plugin inheriting CoreProfile.

// why now

Single-agent was the start. Orchestration is the present. Enterprise is preparing.

Systems like SET can do the work of a full development team — given the right spec and properly developed project types. This is the present, not the future.

Don't blame the model. 90% of agent failures are underspecification on our side. SET exists to enforce structure, verify output, and close those gaps.

Enterprise is next. On-premise models, secure multi-tenant — the infrastructure is coming. Every organization should prepare now.

Model providers will build orchestration natively. We welcome that. But we're not waiting.

// work with us

Build With SET

Open-source and autonomous. Need something custom? We can help.

custom_project_type

We build a ProjectType plugin for your stack and domain. Your rules, your gates, your templates. Pip-installable, works with set-project init.

workshop

Hands-on spec-driven development training. Write specs that produce working apps. Run orchestration, understand gates, build memory. Remote or on-site.

managed_run

Send a spec, get a working app. We run the orchestration, you review the PRs. Quality gates guarantee the output. Ideal for MVPs and proof-of-concepts.

// one more thing

when orchestration gets intense, defend your changes.

arrow keys + space. every change is a ship.

How does SET compare to Cursor, Devin, Kiro, Copilot, Augment?

Read the FAQ

What we do, what we don't, and why — verified against every competitor.