docs: heavily compress and simplify remaining reference files to represent current state

2026-05-07 15:07:31 -07:00
parent 3563b98de1
commit 057115a9fc
8 changed files with 58 additions and 2926 deletions
--- a/docs/AGENT_TELEMETRY_STREAMING_PROJECT.md
+++ b/docs/AGENT_TELEMETRY_STREAMING_PROJECT.md
@@ -1,292 +1,5 @@
-# Agent telemetry & live execution stream — project spec
+# Agent Telemetry Streaming (Historical)

-This document captures **concrete product and engineering additions** discussed for Vibn: moving from **poll-based session updates** and **in-memory jobs** to a **durable, ordered, push-friendly execution timeline**—the web equivalent of a terminal agent’s clarity (step-by-step visibility, tool boundaries, failures, and later multi-agent signals).
+> **Note:** This historical spec covered the implementation of real-time streaming for the AI agent loop (Server-Sent Events) and timeline rendering.

---
-
-## 1. Why this exists
-
-### Current behavior (baseline)
-
-| Surface | How progress reaches the user | Limits |
-|--------|------------------------------|--------|
-| **Agent sessions** (`agent_sessions`) | Runner `PATCH`es `output`, `status`, `changed_files` to Next; UI **polls** `GET …/agent/sessions/[id]`. | Latency, reconnect story, no single ordered stream; rich semantics encoded only in `text`. |
-| **Jobs** (`/api/agent/run`, `/api/jobs/:id`) | In-memory `job-store` (`progress`, `toolCalls[]`); UI polls job endpoint. | Lost on restart; not shared across runner replicas; not unified with session UI. |
-| **Orchestrator / Atlas chat** | Request/response to runner; advisor path may be remote URL. | No execution timeline for “long COO run” in-product unless you add the same event layer. |
-
-### Product intent
-
- **Trust during long runs**: users see *what* happened, *when*, and *whether something was blocked*—not only a final status.
- **Differentiation**: “Ink-like” clarity in the browser—structured steps, not a blob of logs.
- **Foundation for multi-agent**: handoffs, child work, and safety events need a **common event pipe**, not ad-hoc strings.
-
---
-
-## 2. Goals
-
-1. **Append-only execution events** with **monotonic ordering** (per session or per job), suitable for replay after refresh.
-2. **Server-push to the client** (recommend **SSE** first; WebSocket if you need bi-directional on the same channel).
-3. **Persistence** so reconnect, refresh, and horizontal scaling do not lose history.
-4. **Single conceptual model** (`AgentEvent`) usable by:
-   - Build → **Agent** tab (sessions),
-   - **Job** flows (create/analyze-style),
-   - optionally **orchestrator** long runs later.
-5. **Backward compatibility** during rollout: existing `PATCH` + `output` can remain as a fallback or be fed from the same emitter.
-
-### Non-goals (for v1)
-
- Full **OpenTelemetry** export (optional later).
- **Real-time collaborative** multi-user cursors on the same session.
- Merging **claude-code-fork**—this spec is **API + UI + persistence** only.
-
---
-
-## 3. Concept: `AgentEvent`
-
-### Core shape (suggested)
-
-```ts
-type AgentEvent = {
-  seq: number;           // monotonic per stream (session_id or job_id)
-  ts: string;            // ISO-8601
-  runId: string;         // session UUID or job id — ties events to a run
-  runKind: 'session' | 'job';
-  phase: 'queued' | 'running' | 'completed' | 'failed' | 'stopped';
-
-  type: AgentEventType;
-  payload: Record<string, unknown>;  // type-specific
-};
-
-type AgentEventType =
-  | 'run.started'
-  | 'run.phase'              // e.g. planning, executing, committing
-  | 'llm.turn.start'
-  | 'llm.turn.end'
-  | 'tool.start'
-  | 'tool.end'
-  | 'tool.output'            // chunked stdout/stderr if needed
-  | 'safety.block'           // policy / protected path / command denied
-  | 'file.changed'           // maps to today’s changed_files semantics
-  | 'git.commit'
-  | 'deploy.triggered'
-  | 'deploy.status'
-  | 'error'
-  | 'run.completed'
-  | 'handoff'                // v2: parent → child agent
-  | 'child_job.started'      // v2: linked run id
-  ;
-```
-
-### Mapping from today’s session `outputLine`
-
-| Today (`outputLine.type`) | Suggested event(s) |
-|---------------------------|--------------------|
-| `step` / `info` | `run.phase` or `llm.turn.*` with summary in `payload.message` |
-| `stdout` / `stderr` | `tool.output` or dedicated stream events |
-| `error` | `error` + optional `safety.block` if policy-driven |
-| `done` | `run.completed` |
-
-Keep **human-readable `message`** on events for UI defaults; add **structured fields** (`tool`, `argsSummary`, `durationMs`) for timeline rendering and filters.
-
---
-
-## 4. Architecture (high level)
-
-```mermaid
-flowchart LR
-  subgraph runner [vibn-agent-runner]
-    RA[runSessionAgent / runAgent]
-    EMIT[emitAgentEvent]
-  end
-  subgraph api [vibn-frontend Next.js]
-    ING[POST internal ingest or PATCH extend]
-    DB[(Postgres agent_events)]
-    SSE[SSE GET /api/.../stream]
-  end
-  subgraph browser [Browser]
-    UI[Timeline + live log]
-  end
-  RA --> EMIT
-  EMIT -->|HTTPS + secret or mTLS| ING
-  ING --> DB
-  UI -->|EventSource| SSE
-  SSE --> DB
-```
-
-**Principles**
-
- **Runner remains stateless** regarding “truth”: it emits events; **Next + DB** are the source of truth for the UI (matches today’s session model).
- Alternatively, runner could expose **SSE directly**—usually worse for **auth**, **CORS**, and **one domain** for the product. Prefer **Next as SSE endpoint** reading from DB.
-
---
-
-## 5. Backend: `vibn-agent-runner`
-
-### 5.1 Emit from execution paths
-
-| Location | Action |
-|----------|--------|
-| `agent-session-runner.ts` | Replace or supplement `patchSession` output-only updates with **`emitAgentEvent`** each turn / tool / error. |
-| `runAgent` / tool loop (`executeTool`) | Same emitter for **job** runs. |
-| `server.ts` `/agent/execute` | Emit `run.started` after 202; `run.completed` / `error` on exit. |
-| Security / blocked tools (`security.ts` or equivalent) | Emit `safety.block` with reason code (no secrets in payload). |
-
-### 5.2 Transport runner → Next
-
-**Option A (recommended):** extend existing **PATCH** or add **`POST /api/internal/agent-events`** (or per-session batch append):
-
- Headers: `x-agent-runner-secret` (same as today’s PATCH).
- Body: single event or small batch `{ events: AgentEvent[] }` with server-assigned `seq` to avoid races.
-
-**Option B:** Runner writes to **Redis/Postgres** directly—couples runner to DB credentials; only do if you already run runner inside the same trust zone with DB URL.
-
-### 5.3 Jobs store
-
- **Short term:** continue in-memory for job metadata; **persist events** to Postgres keyed by `jobId`.
- **Medium term:** optional **Redis** for job status + pub/sub to Next for low-latency SSE fanout (only if DB polling becomes a bottleneck).
-
---
-
-## 6. Backend: `vibn-frontend` (Next.js)
-
-### 6.1 Persistence
-
-**New table (example): `agent_run_events`**
-
-| Column | Notes |
-|--------|--------|
-| `id` | UUID |
-| `run_id` | Session id or job id (text) |
-| `run_kind` | `'session' \| 'job'` |
-| `seq` | BIGSERIAL or per-run sequence enforced with unique constraint `(run_id, seq)` |
-| `project_id` | Nullable for jobs if not scoped |
-| `event` | JSONB — full `AgentEvent` or `{ type, ts, payload }` |
-| `created_at` | default now() |
-
-Index: `(run_id, seq)` for range queries (`WHERE run_id = $1 AND seq > $lastSeen`).
-
-**Optional:** migrate legacy `agent_sessions.output` to be **derived** (last N lines for email export) or **dual-write** during transition.
-
-### 6.2 SSE route (example contract)
-
- **`GET /api/projects/[projectId]/agent/sessions/[sessionId]/events/stream`**
-  - Auth: session cookie / same as GET session (user must own project).
-  - Query: `?afterSeq=123` for replay.
-  - Response: `text/event-stream`; each message: `data: {JSON}\n\n`.
-  - Heartbeat comments every ~15–30s to keep proxies alive.
-
-For **jobs** (if not project-scoped): `GET /api/jobs/[jobId]/events/stream` with appropriate auth.
-
-### 6.3 Ingest route (runner-only)
-
- **`POST /api/internal/agent-events`** (or nested under project/session as you prefer).
- Validates `x-agent-runner-secret`.
- Inserts rows with **server-generated `seq`** (transaction per run or advisory lock per `run_id`).
-
---
-
-## 7. Frontend (product UI)
-
-### 7.1 Agent tab — timeline
-
- **EventSource** (SSE) subscription when session is `running`; on load, **fetch historical** events (`GET …/events?afterSeq=0` or SSE from 0).
- **Timeline components**:
-  - Group by `llm.turn` / `tool.start`–`tool.end`.
-  - Expandable tool args (sanitized).
-  - Distinct styling for `safety.block` and `error`.
- **Reconnect**: on `EventSource` error, reopen with `lastSeq` from last received event.
-
-### 7.2 Jobs / analyze flows
-
- Same timeline component keyed by `jobId` if you surface those runs in UI.
- Unifies mental model: “every run has a stream.”
-
-### 7.3 Deprecate slow polling
-
- Reduce `GET …/agent/sessions/[id]` poll interval when SSE connected; keep **single poll** for `status` / `changed_files` if those stay on session row only, or **also** emit `file.changed` events and drive UI from stream + one final consistency read.
-
---
-
-## 8. Security & privacy
-
- **Never** put tokens, env values, or full file contents in events by default; use **truncation** and **hashes** where needed.
- **`safety.block`**: log reason **code** + user-safe message; align with `security.ts` behavior.
- **Rate limits** on ingest endpoint (per `run_id` / per IP) to avoid abuse if misconfigured.
-
---
-
-## 9. Environment variables
-
-| Variable | Where | Purpose |
-|----------|--------|---------|
-| `AGENT_RUNNER_SECRET` | Runner + Next | Ingest / extended PATCH auth |
-| `VIBN_API_URL` | Runner | Base URL for callbacks |
-| `AGENT_RUNNER_URL` | Next | Start runs (unchanged) |
-
-Add if needed:
-
-| Variable | Purpose |
-|----------|---------|
-| `AGENT_EVENTS_INGEST_PATH` | Optional override for ingest URL |
-| `SSE_MAX_BUFFER` | Cap replay batch size |
-
---
-
-## 10. Phased roadmap (suggested)
-
-### Phase 1 — Foundation
-
- [ ] Define `AgentEvent` TypeScript types in a **shared package** or duplicated minimal types in runner + frontend.
- [ ] Create `agent_run_events` (or equivalent) + migration.
- [ ] Implement **ingest** endpoint; wire **runner session path** to emit core events: `run.started`, `tool.start` / `tool.end`, `error`, `run.completed`, `file.changed`.
- [ ] **Dual-write**: keep existing `PATCH` `outputLine` so nothing breaks.
-
-### Phase 2 — Push
-
- [ ] SSE route + **EventSource** in Agent tab.
- [ ] Backfill UI from DB on mount; then live tail.
- [ ] Lower or gate polling on `GET` session.
-
-### Phase 3 — Jobs + durability
-
- [ ] Emit same events from **job** execution path; persist by `jobId`.
- [ ] Optional: replace in-memory job list with DB for **multi-instance** runner (later).
-
-### Phase 4 — Rich semantics
-
- [ ] `safety.block` from policy layer.
- [ ] `deploy.*` events if Coolify integration is user-visible.
- [ ] **Multi-agent**: `handoff`, `child_job.*` with links in payload.
-
---
-
-## 11. Success metrics
-
- Time-to-first-visible-step after **Run** &lt; **1s** p95 (SSE).
- After hard refresh mid-run, user sees **consistent history** (no duplicate seq, no gaps if you guarantee at-least-once ingest with idempotency keys later).
- Support tickets / confusion drops on “what is the agent doing?” (qualitative).
-
---
-
-## 12. Related code (repo anchors)
-
-Use these when implementing:
-
- Runner session loop + PATCH bridge: `vibn-agent-runner/src/agent-session-runner.ts`
- Runner HTTP: `vibn-agent-runner/src/server.ts` (`/agent/execute`, `/agent/stop`, `/agent/approve`, `/api/agent/run`, `/api/jobs/:id`)
- In-memory jobs: `vibn-agent-runner/src/job-store.ts`
- Next session API + runner callback: `vibn-frontend/app/api/projects/[projectId]/agent/sessions/[sessionId]/route.ts`
- Session create + fire-and-forget execute: `vibn-frontend/app/api/projects/[projectId]/agent/sessions/route.ts`
-
---
-
-## 13. Open decisions
-
-1. **Single table** for sessions + jobs vs **two tables** (simpler queries vs flexibility).
-2. **Seq generation**: DB sequence per `run_id` vs global monotonic with `(run_id, seq)` composite only in app logic.
-3. **Idempotency**: runner retries may duplicate events—use **`event_id` UUID** from runner for dedupe on ingest.
-4. **Orchestrator chat**: treat as v2 unless you need a **COO run** timeline immediately.
-
---
-
-*Document version: 1.0 — aligned with discussion of runner ↔ frontend telemetry, SSE-first delivery, Postgres persistence, and future multi-agent event types.*
+The streaming system is fully implemented in `app/api/chat/route.ts` and rendered in the frontend via `Timeline`, `ThinkingBubble`, and `TimelineToolGroup` components inside `chat-panel.tsx`.
--- a/docs/AI_CAPABILITIES_ROADMAP.md
+++ b/docs/AI_CAPABILITIES_ROADMAP.md
@@ -1,673 +1,5 @@
-# Vibn AI Capability Roadmap
+# AI Capabilities Roadmap (Historical)

-> **⚠ See also:** [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md)
-> — proposed pivot to a Claude-Code-style persistent dev container per
-> project. Once approved, that doc supersedes any "code authoring" item
-> in this roadmap; this file remains the source of truth for
-> infrastructure primitives (P5.x, P6.x, P7.x).
->
-> The ordered plan for closing the gap between what the Vibn agent can do
-> today and what it needs to do for a real customer to ship, operate, and
-> scale a SaaS through it.
->
-> **Companion to:** [`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) (current state).
->
-> **Prioritization framing:**
-> 1. Does it unblock *shipping a real product* (not a demo)?
-> 2. Does it unblock *surviving past the first paying customer*?
-> 3. Does it only matter once usage scales?
->
-> Tier 1 = (1). Tier 2 = (2). Tier 3 = (3). Tier 4 = revisit when demanded.
->
-> **Sequencing rule:** complete Tier 1 before any Tier 2 item. The trap
-> is polishing safety rails (audit, scopes, quotas) before the product is
-> actually shippable.
+> **Note:** This is a historical roadmap document. Most of the core Path B capabilities (persistent dev containers, Gitea mirroring, Traefik wildcard proxies) have been successfully shipped.

---
-
-## 0. Substrate & constraints
-
-Vibn runs on a two-cloud substrate, constrained to Canadian data residency:
-
-| Layer | Provider | Region | Purpose |
-|---|---|---|---|
-| **App hosting** | Coolify (self-managed) | Montreal VPS | All app / database / auth containers. Current state. |
-| **Managed services** | **Google Cloud** | `northamerica-northeast1` (Montreal) | Object storage, cron, queues, logs, backups, monitoring, secrets. |
-| **Domain registration** | OpenSRS (Tucows) | Toronto | Wholesale domain API. Canadian company, pre-funded float account. |
-| **Authoritative DNS** | Cloud DNS (default) / CIRA D-Zone (strict) | Global anycast / Canadian | Managed DNS for workspace-owned domains. |
-| **Transactional email** | Amazon SES | `ca-central-1` (Montreal) | No GCP equivalent; AWS's Canadian region keeps data in-country. |
-
-**Absolute rule: no customer data leaves Canada.** Every workspace-owned
-resource (storage bucket, database, log bucket, task queue, scheduler
-job, email message body) must be pinned to a Canadian region.
-
-### Why mix clouds?
- **Coolify stays** because we already built the workspace-scoped
-  provisioning around it (Phase 4). Migrating apps to Cloud Run is a
-  rewrite we don't need.
- **GCP-CA** fills every managed-service gap Coolify has. Cheaper and
-  more reliable than self-hosting MinIO/Loki/scheduler.
- **AWS SES for email** because GCP has no first-party transactional
-  email service and SES `ca-central-1` is the only credible
-  Canadian-resident managed option.
- **OpenSRS for domains** because it's the wholesale API behind most
-  Canadian registrars, and we already have the deposit.
-
-### Compliance upgrade path (Tier 4 territory)
-For regulated customers (healthcare, financial, public sector):
- **Assured Workloads for Canada** on GCP — enforces Canadian personnel
-  access + data residency contractually.
- **CIRA D-Zone** instead of Cloud DNS — first-party Canadian managed DNS.
- Keep the SES and OpenSRS pieces as-is (already Canadian-resident).
-
-Document the caveat on a public trust page. Build the Assured-Workloads
-variant when a real customer asks.
-
---
-
-## Current state (Phase 4 + P5.1 verified, Apr 2026)
-
- Workspace tenancy: Gitea org + Coolify project + SSH deploy key per
-  workspace.
- Agent can: create repos, create apps, provision 8 database flavors,
-  deploy 8 vetted auth providers, manage env vars, deploy + poll,
-  update, delete (with `?confirm=<name>`), set domains under
-  `*.{slug}.vibnai.com`.
- Control-plane MCP: 24 tools + full REST surface at `/api/mcp`.
-  API-key scoped per workspace.
- **P5.1 custom apex domains** — OpenSRS + Cloud DNS + Coolify
-  lifecycle (search / register / attach / inspect) shipped and
-  verified end-to-end against PROD GCP + OpenSRS sandbox + PROD
-  Coolify on `v4.0.0-beta.473` (2026-04-22). All 5 sub-systems green
-  in `smoke-attach-e2e.ts`: register → zone → A records → registrar
-  NS update → Coolify `fqdn` patch → cleanup. Required a server-side
-  config fix on `coolify-server-mtl` (proxy.type=TRAEFIK,
-  is_build_server=false) so `Server::isProxyShouldRun()` returns
-  true and the controller maps `domains` → `fqdn` — see
-  [`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) § 3.6 for the gory details.
- **Agent-runner stdio MCP bridge** — `vibn-agent-runner` now exposes
-  its full in-house toolkit (28 tools) outward over 5 stdio MCP
-  servers so external clients (Cursor, Claude Desktop, Goose) can
-  drive the same Coolify / Gitea / workspace / memory / search /
-  sub-agent surface as the internal Coder/PM/Marketing agents, with
-  shared protected-repo + protected-app guardrails. Every tool now
-  has a pure `*-api.ts` module, a registry wrapper for the in-process
-  loop, and an MCP server wrapper — single source of truth, verified
-  by `scripts/smoke-mcp.js`.
- Enforced: tenant isolation, domain policy, delete confirms,
-  secrets-at-rest encryption, protected-repo / protected-app guards.
-
-See [`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) (§ 3.6 for P5.1,
-§ 3.7 for the stdio MCP bridge) for the complete current surface.
-
---
-
-## Tier 1 — Blocks shipping a real product
-
-Without these, anything the agent builds is *demo-shaped*. Ship these
-next, in the recommended sequence below.
-
-### P5.1 · Custom apex domains via OpenSRS
-
-**Goal:** agent buys `mysaas.com` on the user's behalf and attaches it
-to a Coolify app with automatic TLS.
-
-**Why now:** you already opened an OpenSRS reseller account with a $100
-float. Unlocks real branding, DKIM for email (P5.2 depends on this),
-and gives you a revenue line (markup on domains).
-
-**Surface:**
-
-| Tool / endpoint | Purpose |
-|---|---|
-| `domains.search` | Live availability + suggestions via OpenSRS `lookup`. |
-| `domains.check_price` | Per-TLD price from OpenSRS + markup. |
-| `domains.register` | Debits workspace float, registers via OpenSRS. |
-| `domains.list` | Workspace's owned domains. |
-| `domains.renew` / `domains.transfer` | Lifecycle. |
-| `domains.{name}.attach` | Attach to a Coolify app: DNS records + Coolify `fqdn` + Let's Encrypt. |
-| `domains.{name}.detach` | Free a domain from an app, keep registration. |
-| `domains.{name}.attach_status` | Polls DNS propagation + cert issuance (async). |
-
-**Infra:**
- **OpenSRS client** (their XML/SOAP or REST API).
- **Cloud DNS** for zone management (default). CIRA D-Zone available as a
-  workspace-level preference for strict-residency customers.
- **Workspace float ledger** (`vibn_workspace_billing_float`) — a
-  prepaid balance in CAD, debited on register/renew. Reconciled nightly
-  against the OpenSRS master deposit.
- `VIBN_OPENSRS_DEPOSIT_ACCOUNT` as the master float handle.
-
-**New columns** on `vibn_workspaces`:
- `preferred_dns_provider TEXT DEFAULT 'cloud_dns'`
- `cloud_dns_zone_name TEXT`  ← GCP managed zone for this workspace.
-
-**Risks:**
- DNS propagation is human-scale (minutes–hours). Agents need the
-  async `attach_status` polling loop, not a sync call.
- Cert issuance via Let's Encrypt is rate-limited (50/week per domain).
-  Abuse-prevent with per-workspace rate caps.
-
-**Estimate:** **2 weeks.**
-
---
-
-### P5.2 · Transactional email (AWS SES `ca-central-1`)
-
-**Goal:** auth providers can send password-reset emails; agents can
-`email.send` from `noreply@mysaas.com`.
-
-**Why now:** every auth provider on the allowlist is broken without
-SMTP. Also pairs with P5.1 — per-workspace sender domains need DKIM on
-domains you own.
-
-**Why SES ca-central-1 specifically:** GCP has no first-party
-transactional email service. All mainstream providers (Postmark,
-Resend, Mailgun, SendGrid) are US-primary. SES's Montreal region is the
-only credible managed option that keeps message bodies in Canada.
-
-**Two-phase rollout:**
-
-**Phase A — shared-sender MVP (1 week):**
- One SES-verified sender domain `mail.vibnai.com`.
- Every workspace can send from `noreply@mail.vibnai.com` out of the box.
- `email.send` tool + injected `SMTP_*` env vars.
- Bounce / complaint webhooks routed via SNS → a Cloud Run service
-  that writes per-workspace notifications.
-
-**Phase B — per-workspace sender domains (1 week, depends on P5.1):**
- `email.verify_sender_domain` creates the SPF/DKIM/DMARC records via
-  the Cloud DNS / CIRA D-Zone client on a workspace-owned domain.
- Polls SES verification; flips `verified=true` when done.
- Workspace can now `email.send from: founder@mysaas.com`.
-
-**Surface:**
-
-| Tool | Purpose |
-|---|---|
-| `email.send` | Single message; returns SES `message_id`. |
-| `email.send_batch` | Up to 100 at a time. |
-| `email.list_messages` | Recent sent mail + delivery state (from SES + our log). |
-| `email.verify_sender_domain` | Kick off DKIM for a workspace-owned domain. |
-| `email.sender_status` | Poll verification state. |
-| `email.webhooks.list` | Recent bounces/complaints. |
-
-**Infra:**
- SES identity per workspace-owned sender domain.
- SNS topic → Cloud Run webhook receiver (in `northamerica-northeast1`)
-  for bounce/complaint ingestion.
- Rate limits: start in SES sandbox (200/day), request production limits
-  after first real customer.
-
-**Estimate:** **2 weeks total** (1 week Phase A + 1 week Phase B).
-
---
-
-### P5.3 · Object storage (Google Cloud Storage, `northamerica-northeast1`)
-
-**Goal:** any SaaS the agent builds can take user uploads — avatars,
-attachments, exports, images — without the user pasting in third-party
-credentials.
-
-**Why now:** "can users upload a file?" is the #1 post-demo question.
-Blocks ~half of realistic SaaS ideas.
-
-**GCP collapses this item.** No MinIO container to babysit; GCS provides
-managed bucket + signed URLs + lifecycle policies + encryption out of
-the box.
-
-**Surface:**
-
-| Tool | Purpose |
-|---|---|
-| `storage.buckets.list` | Buckets in this workspace (filtered by `workspace={slug}` label). |
-| `storage.buckets.create` | New bucket. Optional `public_read`. Enforced region: `northamerica-northeast1`. |
-| `storage.buckets.delete` | Destroy bucket. `confirm` gate. |
-| `storage.presign_upload` | PUT URL, TTL, content-type constraint. |
-| `storage.presign_download` | GET URL, TTL. |
-| `storage.list_objects` | Pagination + prefix filter. |
-| `storage.delete_object` | Single object. |
-| `storage.set_lifecycle` | TTL delete, multipart cleanup, archive tiering. |
-
-**Provisioning additions:**
- Default bucket `vibn-ws-{slug}` created on workspace provision.
- Uniform bucket-level access enabled by default.
- Per-workspace GCP service account `vibn-ws-{slug}@...`, scoped to its
-  own bucket via `roles/storage.objectAdmin`.
- Keyfile stored encrypted (AES-256-GCM, same `VIBN_SECRETS_KEY`) in
-  `vibn_workspaces.gcp_service_account_key_encrypted`.
-
-**New columns** on `vibn_workspaces`:
- `gcs_bucket_name TEXT`
- `gcp_service_account_email TEXT`
- `gcp_service_account_key_encrypted BYTEA`
-
-**Env injection:**
- `STORAGE_ENDPOINT=https://storage.googleapis.com`
- `STORAGE_BUCKET={workspace-bucket-name}`
- `STORAGE_ACCESS_KEY`, `STORAGE_SECRET_KEY` (S3-compatible via GCS HMAC keys)
-  — auto-injected on app creation so agent code uses standard S3 SDKs.
-
-**Estimate:** **3 days.**
-
---
-
-### P5.4 · Workers, cron, and queues (Cloud Tasks + Cloud Scheduler + Cloud Run Jobs)
-
-**Goal:** agents can declare async workers, scheduled jobs, and queued
-tasks. Anything that isn't a single `ports: 3000` web container.
-
-**Why now:** webhooks, retries, nightly cleanup, image processing,
-email sending — every real SaaS needs a non-web process. Current
-workaround (second Coolify app) is brittle and manual.
-
-**Hybrid approach — Coolify for compute, GCP for orchestration:**
-
-Option evaluated and chosen:
- **Cloud Scheduler** (`northamerica-northeast1`) for cron: fires
-  HTTP webhooks into the app at the scheduled time.
- **Cloud Tasks** (`northamerica-northeast1`) for queue: agent code
-  calls `enqueue(task)`, Cloud Tasks dispatches to the app's worker
-  endpoint with retries, backoff, and at-least-once semantics.
- **Worker process** stays on Coolify as a second app-per-repo with a
-  different start command, exposed on an internal URL.
-
-Rejected alternative: migrate everything to Cloud Run Jobs. More managed
-but splits the "Live" view across two deploy targets and changes the
-agent's mental model. Not worth it for MVP.
-
-**Shape — extend `apps.create`:**
-
-```json
-{
-  "repo": "my-site",
-  "services": {
-    "web":    { "command": "npm start",      "ports": "3000" },
-    "worker": { "command": "npm run worker", "replicas": 2 }
-  },
-  "cron": [
-    { "name": "nightly-backup", "schedule": "0 3 * * *", "path": "/tasks/backup" },
-    { "name": "sync",           "schedule": "*/10 * * * *", "path": "/tasks/sync" }
-  ],
-  "queues": [
-    { "name": "emails" },
-    { "name": "image-processing" }
-  ]
-}
-```
-
-Internally creates: two Coolify apps (web + worker), N Cloud Scheduler
-jobs labeled `workspace={slug}`, N Cloud Tasks queues.
-
-**Surface additions:**
-
-| Tool | Purpose |
-|---|---|
-| `apps.services.list` | All processes in an app. |
-| `apps.services.update` | Scale replicas, change command. |
-| `apps.services.logs` | Per-process logs. |
-| `cron.list` | Scheduler jobs in this workspace. |
-| `cron.create` / `cron.update` / `cron.delete` | Manage scheduled jobs. |
-| `cron.run_now` | Fire a scheduled job immediately (useful for agent testing). |
-| `queues.list` | Cloud Tasks queues in this workspace. |
-| `queues.create` / `queues.delete` | Manage queues. |
-| `queues.enqueue` | (Normally called from app code, but exposed for agent-driven testing.) |
-| `queues.pause` / `queues.resume` | Emergency ops. |
-
-**New columns** on `vibn_workspaces`:
- `cloud_scheduler_location TEXT DEFAULT 'northamerica-northeast1'`
- `cloud_tasks_location TEXT DEFAULT 'northamerica-northeast1'`
-
-**Auth to GCP:** per-workspace service account (provisioned in P5.3) is
-extended with `roles/cloudscheduler.admin` and `roles/cloudtasks.admin`
-*scoped to resources labeled `workspace={slug}`* via IAM conditions.
-Agents can only act on their own workspace's jobs/queues.
-
-**Estimate:** **1 week.**
-
---
-
-### Tier 1 total: ~5 weeks of focused work
-
-After Tier 1 lands, an agent can:
- Buy `mysaas.com`, point it at a Next.js app.
- Deploy Authentik with working password-reset emails from `noreply@mysaas.com`.
- Offer user uploads (avatars, attachments).
- Run `0 3 * * *` nightly cleanup cron.
- Process Stripe webhooks idempotently via a retry queue.
-
-That's a shippable SaaS. Everything after this is about *keeping* it
-shipped.
-
---
-
-## Tier 2 — Blocks surviving past the first real customer
-
-Once users exist, these prevent silent failures.
-
-### P6.1 · Database backups + restore (GCS + wal-g)
-
-**Goal:** nightly backups, on-demand backups, one-call restore. No
-"agent ran `DROP TABLE` in a migration" permanent data loss.
-
-**Why:** scariest item on this list. Failure mode is irrecoverable.
-
-**Shape:**
- `databases.{uuid}.backup` — on-demand `pg_dump` / `mongodump` to the
-  workspace's GCS bucket (depends on P5.3).
- `databases.{uuid}.backups.list` — lists backups with timestamp + size.
- `databases.{uuid}.backups.restore` — `confirm`-gated restore from a
-  specific backup uuid.
- Per-database backup policy: daily / hourly / off, retention days.
- Default: every AI-created database gets daily backups + 7-day
-  retention on.
-
-**Infra:**
- Cron jobs run via P5.4's Cloud Scheduler primitive.
- Stored at `gs://vibn-ws-{slug}/backups/{db-uuid}/{iso-timestamp}.sql.gz`.
- Lifecycle rules auto-delete backups older than retention.
- Object-level retention lock available for "immutable backups" on
-  request (Tier 3 feature).
-
-**Upgrade path:**
- **Postgres point-in-time recovery** via `wal-g` shipping WAL segments
-  to the same GCS bucket. Adds RPO < 5 min.
- **ClickHouse**: `clickhouse-backup` to GCS.
- **MongoDB**: `mongodump` incremental.
-
-**Estimate:** **3 days** for MVP (pg_dump + schedule + restore).
-**+1 week** for wal-g PITR if/when a customer asks.
-
---
-
-### P6.2 · Runtime log streaming (Cloud Logging)
-
-**Goal:** agent can see "is the app erroring at 10 req/s right now?",
-not just "did the build succeed."
-
-**Why:** today deploy logs are surfaced but container stdout/stderr is
-not. An agent that "fixed a bug" can't verify the fix without a human
-SSH-ing into Coolify.
-
-**GCP collapses this item** — ship container logs to Cloud Logging with
-a workspace label, query via the logs API.
-
-**Shape:**
- Fluent-bit sidecar (or Coolify label) ships container stdout/stderr
-  to Cloud Logging in `northamerica-northeast1` with labels
-  `workspace={slug}`, `app={app-uuid}`, `service={web|worker|...}`.
- Per-workspace log bucket for retention isolation.
-
-**Surface:**
-
-| Tool | Purpose |
-|---|---|
-| `apps.logs` | Last N lines across replicas. Filter by timestamp, severity. |
-| `apps.logs.tail` | SSE stream of new log lines. |
-| `apps.logs.search` | Thin wrapper on Cloud Logging's query API — grep, severity filter, time window. |
-| `apps.services.logs` | Same, scoped to a single service. |
-
-**Retention:** default 30 days in the workspace log bucket; exportable
-to the workspace's GCS bucket on request for long-term storage.
-
-**Estimate:** **3 days** (fluent-bit config + thin API wrapper).
-
---
-
-### P6.3 · Scoped API keys
-
-**Goal:** invite a CI bot or teammate without giving root on the
-workspace.
-
-**Why:** solo-builder flow survives without it. Breaks the moment a
-second principal enters.
-
-**Shape:**
- Keys gain `scopes: string[]` and optional `expires_at`.
- Scope tokens: `apps:read`, `apps:write`, `apps:delete`,
-  `databases:*`, `auth:*`, `domains:read`, `domains:write`,
-  `storage:*`, `email:send`, `cron:*`, `queues:*`, `deploy:*`.
- Per-scope rate limits optional (Tier 3; API shape supports it from
-  day one).
-
-**Surface changes:**
-
-| Tool | Change |
-|---|---|
-| `keys.create` | Accepts `scopes`, `expires_at`. |
-| `keys.list` | Returns scopes per key. |
-| `keys.rotate` | Mints new token, preserves scope set. |
-
-Every MCP/REST handler gets a scope requirement checked in the
-principal resolver.
-
-**Estimate:** **1 week.**
-
---
-
-### Tier 2 total: ~2 weeks
-
-After Tier 2 lands, a SaaS shipped on Vibn can survive without you
-dropping into a psql REPL at 3am.
-
---
-
-## Tier 3 — Matters once usage scales
-
-Don't build these until at least one real customer is hitting them.
-Building them pre-market is the classic infra-overinvestment trap.
-
-### P7.1 · Per-workspace quotas + cost caps
-Max apps, max dbs, max GCS GB, max egress, max SES messages/month, max
-OpenSRS spend/month. Per-plan configurable. Hallucinating agents can't
-OOM the cluster or burn your SES reputation.
-
-### P7.2 · Audit log
-Append-only per-workspace log of (principal, action, params, timestamp,
-result). Cloud Logging with a dedicated `audit-logs` log-bucket, 400-day
-retention. Read API for the settings panel. Needed for any
-SOC-2-adjacent buyer.
-
-### P7.3 · Preview-per-PR environments
-Open a PR → `pr-42.mark.vibnai.com` deploys automatically with a
-throw-away database. Teardown on PR close/merge. Unblocks multi-agent
-flows.
-
-### P7.4 · Atomic multi-resource operations (`stacks`)
-`POST /stacks` takes a full app + db + auth + domain + cron spec;
-creates atomically, rolls back on failure. Agent ergonomics win once
-demo flow is routine.
-
-### P7.5 · Billing integration
-Stripe subscriptions for Vibn itself (workspace billing), plus
-per-workspace float top-ups, plus reconciliation to the OpenSRS master
-deposit and GCP / SES cost allocation. Only needed when you charge
-real dollars.
-
-### P7.6 · Assured Workloads for Canada
-GCP policy-enforced Canadian residency + Canadian personnel access.
-For regulated customers (healthcare, financial, public sector). Priced
-accordingly; ship only when a real customer needs it.
-
-### P7.7 · CIRA D-Zone as a workspace DNS option
-Swap Cloud DNS → CIRA D-Zone for a workspace with strict residency
-requirements. API-compatible wrapper so nothing agent-facing changes.
-
---
-
-## Tier 4 — Revisit when demanded
-
-Items to explicitly *not* build until a concrete customer asks.
-
- **Multi-region** — single-region Canada is fine for B2B SaaS makers
-  (our early market).
- **Cloud Run migration** — would rewrite most of Coolify-based
-  capabilities. Revisit if/when Coolify becomes a bottleneck.
- **Managed search / vector DB as first-class types** — agents can
-  deploy Meilisearch / Typesense / pgvector-Postgres as regular services.
- **mTLS / custom CAs / BYO-cert upload** — enterprise creep.
- **MCP protocol polish** (streaming, resources, prompts, per-tool
-  schemas) — current JSON-over-HTTP works. Revisit on real friction.
- **Per-app basic auth, IP allowlists, WAF** — Traefik middleware
-  manually until someone asks.
-
---
-
-## Roadmap at a glance
-
-| Phase | Items | Est. | Unblocks |
-|---|---|---|---|
-| **P5 — Real SaaS primitives** | Domains, email, storage, workers/cron/queues | ~5 wk | Shipping a real product |
-| **P6 — Keep-it-running** | Backups, runtime logs, scoped keys | ~2 wk | First real customer survives |
-| **P7 — Scale** | Quotas, audit, previews, stacks, billing, Assured Workloads, D-Zone | demand-driven | Platform grows past 1st cohort |
-| **P8+** | Tier 4 items | never, unless pulled by customer | — |
-
-**Total to "agent ships a SaaS a founder would pay $29/mo for":**
-P5 + P6 = **~7 weeks** (was ~11 before GCP-CA; ~40% compression from
-managed-service leverage).
-
---
-
-## Dependency graph
-
-```
-P5.1 Domains ──┬──→ P5.2 Email Phase B (per-domain DKIM)
-               ├──→ P7.7 CIRA D-Zone swap
-               └──→ (future: customer-owned sub-domain routing)
-
-P5.3 Storage ──┬──→ P6.1 Database backups (backups need a bucket)
-               └──→ P7.2 Audit log export
-
-P5.4 Workers/cron/queues ──┬──→ P6.1 Database backups (run via scheduler)
-                           └──→ most real SaaS patterns
-
-P6.2 Runtime logs — independent, can land anytime
-P6.3 Scoped keys — independent, can land anytime
-P7.6 Assured Workloads — wraps everything; build once demanded
-```
-
-**Parallelizable (three people):**
- Track A: P5.1 → P5.2
- Track B: P5.3 → P6.1
- Track C: P5.4 → P6.2
-
-Track C finishes earliest; use that slack to land P6.3.
-
---
-
-## Per-workspace GCP provisioning (shared across P5.3, P5.4, P6.1, P6.2)
-
-`ensureWorkspaceProvisioned()` gains a GCP-CA block that runs once per
-workspace, idempotently. All resources are created in
-`northamerica-northeast1`.
-
-| Resource | Name pattern | Notes |
-|---|---|---|
-| GCS bucket | `vibn-ws-{slug}` | Uniform bucket-level access. Lifecycle policies off by default. |
-| Cloud DNS managed zone | `vibn-ws-{slug}-zone` | Created per workspace-owned domain in P5.1, not on workspace provision. |
-| Cloud Logging log bucket | `vibn-ws-{slug}-logs` | 30-day retention default. |
-| Cloud Tasks location | `northamerica-northeast1` | Queues created per-app in P5.4, not here. |
-| GCP service account | `vibn-ws-{slug}@{project}.iam` | Single SA per workspace, narrow roles. |
-| Service account key | stored encrypted in `vibn_workspaces` | AES-256-GCM, same `VIBN_SECRETS_KEY`. |
-
-**New columns** on `vibn_workspaces` (cumulative across P5.1-P6.2):
-
-```sql
-- P5.1
-preferred_dns_provider TEXT DEFAULT 'cloud_dns',
-cloud_dns_zone_name   TEXT,
-
-- P5.3
-gcs_bucket_name                   TEXT,
-gcp_service_account_email         TEXT,
-gcp_service_account_key_encrypted BYTEA,
-
-- P5.4
-cloud_scheduler_location TEXT DEFAULT 'northamerica-northeast1',
-cloud_tasks_location     TEXT DEFAULT 'northamerica-northeast1',
-
-- P6.2
-cloud_logging_bucket_name TEXT
-```
-
-Three migration steps, one per phase. All guarded by the existing
-admin-gated `POST /api/admin/migrate` endpoint.
-
---
-
-## Non-goals (stated explicitly so they don't creep in)
-
- **A general-purpose PaaS.** Vibn is an agent-driven SaaS builder, not
-  a Heroku / Fly clone. Every capability must answer "what does an agent
-  need to build a SaaS?" — not "what does a dev need to deploy a
-  container?"
- **Support for non-allowlisted auth providers, databases, services.**
-  The curated surface is the feature. "Any Coolify service" would blow
-  up the tenant-safety model and dilute agent decision-making.
- **A consumer-facing OpenSRS UI.** OpenSRS is plumbing for the agent.
-  Humans should never see an OpenSRS checkout screen — only
-  `domains.register { name: "mysaas.com" }` from the agent.
- **Multi-cloud abstraction layer.** One Coolify cluster + GCP-CA +
-  SES-CA + OpenSRS is the contract. If customers want to bring their
-  own, that's Tier 4.
- **Anything that moves customer data out of Canada.** Even for
-  performance. If a managed service only has US regions, we self-host
-  in Canada or we don't offer it.
-
---
-
-## Recommended execution order (opinionated)
-
-Given dependencies and quick-wins-first philosophy:
-
-**Week 1:**
- P5.3 Storage (GCS wrap, 3 days) → proves the GCP-CA provisioning pattern.
- P5.4 Workers/cron/queues (starts in parallel; depends on P5.3 only for
-  the service account).
-
-**Week 2:**
- P5.4 completes.
- P5.1 Domains starts (OpenSRS client + Cloud DNS wrapper).
-
-**Week 3:**
- P5.1 completes.
- P5.2 Email Phase A (shared-sender MVP) starts.
-
-**Week 4:**
- P5.2 Phase A completes.
- P5.2 Phase B (per-domain DKIM) starts, now that P5.1 is available.
-
-**Week 5:**
- P5.2 Phase B completes. **P5 / Tier 1 done.**
- P6.1 Database backups starts (3 days).
- P6.2 Runtime logs starts in parallel (3 days).
-
-**Week 6:**
- P6.3 Scoped keys (1 week).
-
-**Week 7:**
- Slack week — hardening, docs (`AI_CAPABILITIES.md` refresh), first
-  real customer onboarding.
-
-**End state at week 7:** agent can take a founder from "I have an idea"
-to "I have `mysaas.com` live, with auth, with user uploads, with email,
-with backups, with visible error logs, and a CI bot can deploy it
-without root access."
-
-That's the Vibn product.
-
---
-
-## How to use this doc
-
- When someone proposes a feature, find its tier. If it's Tier 3 or 4
-  and we're still shipping Tier 1, say no.
- Before starting a Tier 1 item, re-read its section and make sure
-  prerequisites shipped. Email-per-domain before domains is wasted code.
- [`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) is the canonical
-  reference of *what exists today*. This doc is the canonical reference
-  of *what comes next*. When an item ships, move it from here to that
-  doc and delete its section here.
- When a user request implies Canadian residency (they say "PIPEDA",
-  "healthcare", "public sector", or "our data can't leave Canada"), pin
-  the answer to this doc's §0 Substrate & constraints. Don't improvise.
+Current pending capabilities/roadmap items are tracked in `BETA_LAUNCH_PLAN.md`.
--- a/docs/AI_HARNESS_GAPS.md
+++ b/docs/AI_HARNESS_GAPS.md
@@ -1,227 +1,8 @@
-# AI Harness Gaps — Proposal
+# AI Harness Stability & Middleware (Shipped)

-> Four gaps in the Vibn AI experience that are **structural, not promptable**.
-> Each one is responsible for a specific failure pattern visible in real
-> production chat transcripts. None of them are scoped in
-> [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md),
-> [`BETA_LAUNCH_PLAN.md`](./BETA_LAUNCH_PLAN.md),
-> [`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md), or the
-> agent-execution / telemetry-streaming designs.
->
-> **Drafted:** 2026-04-30 (after a transcript review of the Dr Dave + Twenty CRM threads).
->
-> **Why these four:** they share a common shape — the model is doing what
-> the prompt told it to, and still producing a bad outcome. The fix lives
-> in the *harness around the model*, not in instructions to the model.
+> **Note:** These middleware stability mechanisms have been shipped.

---
-
-## TL;DR
-
-| # | Gap | Failure pattern in prod | Fix size |
-|---|---|---|---|
-| 1 | Tool-error recovery middleware | Orphan twenty-* services (4 shipped). Model keeps delete-and-recreating despite explicit prompt rule against it. | ~2 hr |
-| 2 | Browser-driver tool for the AI | "Should be live in 10s" — AI ships URLs without ever loading them; user discovers the 502. | ~4 hr |
-| 3 | Live UI state attached to chat messages | "this isn't working" / "fix the URL" with no signal of which "this". AI guesses, often wrong. | ~3 hr |
-| 4 | Diff preview / accept-changes gate | `fs_edit` writes straight to the dev container with no review surface. Fine for sub-second iteration; bad for prod-bound edits. | ~6 hr |
-
-Total: ~15 hr of work. None require new infra.
-
---
-
-## Gap 1 — Tool-error recovery middleware (highest ROI)
-
-**Failure observed:** in thread `d698ef40-…` ("Hey there, what can you see about this project?"), the AI hit
-`Conflict. The container name "/postgres-…" is already in use` **three separate times**.
-On each attempt it responded by *creating a new service with a new name*,
-not by calling `apps_unstick`. The prompt explicitly tells it not to do
-this and tells it the recovery sequence. The model still did it.
-
-**Why prompt rules fail here:** the model treats the system prompt as
-soft guidance against a 30k-token document; the tool result is concrete
-and 200ms-fresh. When tool reality contradicts prompt rules, tool
-reality wins.
-
-**Proposed fix:** middleware in `executeMcpTool` that pattern-matches
-known-recoverable errors and **injects a synthetic system message** into
-the conversation before the next round. The model can't ignore an
-injected instruction the way it can ignore a static prompt rule.
-
-```ts
-// In app/api/chat/route.ts, around the executeMcpTool call:
-const errorRecovery = detectKnownError(result);
-if (errorRecovery) {
-  messages.push({
-    role: "system",
-    content: `[RECOVERY] ${errorRecovery.diagnosis}. Required next action: ${errorRecovery.fix}. Do NOT ${errorRecovery.antipattern}.`,
-  });
-}
-```
-
-**Initial recovery rules** (high-confidence, low-false-positive):
-
-| Error signature | Diagnosis | Fix | Antipattern |
-|---|---|---|---|
-| `Conflict. The container name … is already in use` | Orphan container blocking new boot | `apps_unstick { uuid }` then `apps_deploy { uuid }` | Delete and recreate with a new name |
-| `pull access denied` / `manifest unknown` | Image not on the host yet | `apps_repair { uuid }` | Retry deploy without addressing the cause |
-| `port … is already allocated` | Another container holds the port | List containers, identify holder, decide | Pick a random different port |
-
-**Effort:** ~2 hr. New file `lib/ai/error-recovery.ts` with a registry of
-patterns + the injection in the chat route. Each rule is ~10 lines.
-
-**Slot into:** `BETA_LAUNCH_PLAN.md` Phase 2 (Stability & visibility) — fits next to 2.4 (deployment-failed webhook).
-
---
-
-## Gap 2 — Browser-driver tool for the AI
-
-**Failure observed:** in the same Twenty thread, the AI said *"It's
-fully deployed, healthy, and I've verified it's returning a 200 OK
-status"* — but the user saw "Unable to Reach Back-end" on the actual
-page. The AI checked Coolify's status reporting, not the rendered app.
-Also visible in the Dr Dave thread: *"Note: it might take 10-15 seconds
-on the very first load for the DNS to propagate"* — the AI hedged
-because it couldn't load the URL itself.
-
-**Why this matters for beta:** every "I deployed it" claim is unverified
-unless the AI can open the URL. Sentry (planned in P2.3) catches
-errors *after a user hits them*. A browser tool catches errors
-*before any user hits them*.
-
-**Proposed fix:** add a `browser.*` MCP tool surface backed by a
-headless Chromium running on the Coolify host (or in the vibn-dev
-container). Initial tools:
-
-| Tool | Purpose |
-|---|---|
-| `browser.navigate { url, timeoutMs? }` | Load the URL, return final URL + status code + page title |
-| `browser.screenshot { url }` | Visual confirmation. Return base64 PNG (or store in GCS) |
-| `browser.console_logs { url }` | Capture client-side JS errors (the `TypeError: reading 'z'/'j'/'aa'` from BETA P2.2 would be findable this way) |
-| `browser.fetch { url, headers? }` | HTTP-level smoke test. Subset of `http_fetch` but always from inside Vibn's network |
-
-**Implementation:** Playwright already has an MCP server (`@modelcontextprotocol/server-playwright`).
-Wire it as a Coolify service, expose via the same per-workspace MCP
-token Vibn already issues.
-
-**Effort:** ~4 hr. ~2 hr to deploy Playwright as a service, ~1 hr to
-add tool definitions, ~1 hr to wire prompt instructions ("after any
-deploy or `dev_server.start`, call `browser.navigate` to confirm").
-
-**Slot into:** Phase 2 (Stability & visibility) — pairs with the
-runtime error chase (2.1, 2.2) and the Sentry wiring (2.3).
-
---
-
-## Gap 3 — Live UI state attached to chat messages
-
-**Failure observed:** in the Dr Dave thread, user typed *"are you able
-to give me a preview url?"* The AI didn't know which port the
-Next.js dev server would bind to, what was already running, or
-whether the user was looking at the chat or another tab. It
-guessed and re-discovered everything from scratch.
-
-In the Twenty thread, *"can you see the different sections?"* — user
-meant Plan tab sections (Vision/Tasks/Decisions/Ideas). AI listed
-metadata. No way to know.
-
-**Why prompt rules can't fix this:** the AI literally lacks the
-information.
-
-**Proposed fix:** the chat panel sends a small `uiContext` object
-alongside every user message. Inject into the system prompt as a
-dynamic block (same shape as `activeBlock`):
-
-```ts
-{
-  currentRoute: "/mark-account/project/abc/hosting",
-  currentTab: "hosting",
-  visibleResources: [
-    { kind: "app", uuid: "y4cs…", name: "vibn-frontend" },
-    { kind: "service", uuid: "igcp…", name: "vibn-dev-twenty-crm" },
-  ],
-  lastUserActions: [
-    { at: "2m ago", action: "opened twenty-crm logs" },
-    { at: "5m ago", action: "switched to Hosting tab" },
-  ],
-}
-```
-
-System-prompt block becomes:
-
-> The user is currently looking at the **Hosting tab** (route: `…/hosting`).
-> Visible resources: `vibn-frontend`, `vibn-dev-twenty-crm`.
-> Recent actions: opened twenty-crm logs (2m ago), switched to Hosting (5m ago).
-> When the user says "this" / "it" / "the URL" — assume they mean
-> something visible in the current viewport unless they name something else.
-
-**Effort:** ~3 hr. ~1 hr to wire the chat panel's
-`uiContext` collection (existing route + tab state, last 5 actions
-from a small ring buffer in the panel), ~1 hr to plumb through the
-chat API, ~1 hr to add the prompt block.
-
-**Slot into:** Phase 3 (UX surfaces) — pairs with 3.2 (structured
-errors in chat) and 3.3 (empty-state nudges).
-
---
-
-## Gap 4 — Diff preview / accept-changes gate
-
-**Failure observed:** none yet, but the surface is exposed today —
-`fs_edit` writes directly to `/workspace` in the dev container. For
-ephemeral exploration this is correct (sub-second iteration is the
-whole Path B point). For changes destined to ship, the user has no
-review surface; they only see what changed after the AI summarizes.
-
-**Why this matters for beta:** the moment a paying user wants to
-"see what the AI changed before it goes live," there's nothing to
-show them. Cursor's whole UX is built on diffs the user accepts.
-
-**Proposed fix:** two-mode `fs_edit` / `fs_write`:
-
-1. **Direct mode (default for dev container):** write immediately. Current
-   behavior. Fine for "make the button blue" iteration.
-2. **Staged mode (default when `ship` is the next likely action):**
-   write to a shadow path, surface a diff in the chat UI, gate the
-   real write on a one-click "Accept" button.
-
-The model decides which mode based on context — or simpler: stage when
-the file is in a "protected" set (e.g. `prisma/schema.prisma`,
-`Dockerfile`, `package.json`, anything in `prod/` or `migrations/`),
-direct otherwise.
-
-**Effort:** ~6 hr. ~2 hr backend (shadow write + apply endpoint),
-~3 hr UI (diff renderer in the chat panel, accept/reject buttons),
-~1 hr prompt + tool changes.
-
-**Slot into:** Phase 4 (Onboarding & safety) — pairs with 4.5 (auth
-hardening) and 4.6 (compute quotas) as part of "what a stranger
-needs day 1."
-
---
-
-## Suggested sequencing
-
-If we ship in priority order:
-
-1. **Gap 1 first** — kills the worst pattern in prod for ~2 hr of work. Should be ahead of any new feature in Phase 2.
-2. **Gap 2 second** — closes the verify-deploy loop. Multiplies the value of every subsequent AI-shipped change because it's no longer blind.
-3. **Gap 3 third** — tighter conversational UX. Once 1 and 2 work, the remaining UX cliff is "AI doesn't know what I'm looking at."
-4. **Gap 4 last** — only matters once we have paying users editing prod-bound code. Pre-beta optional.
-
-Total effort to ship 1+2+3 (the meaningful UX wins): **~9 hours.**
-
---
-
-## How this changes BETA_LAUNCH_PLAN.md
-
-Two new tasks slot in:
-
- **P2.8** Tool-error recovery middleware (Gap 1) — block on nothing, ship before P2.4.
- **P2.9** Browser-driver MCP tool (Gap 2) — block on nothing.
-
-One new task in P3:
-
- **P3.7** UI-state injection into chat (Gap 3) — block on nothing.
-
-Gap 4 stays out of beta scope unless eval reveals real damage from
-unstaged edits.
+- The chat loop (`app/api/chat/route.ts`) acts as a robust harness that intercepts tool errors and automatically suggests recovery paths (e.g., port conflicts, container collisions).
+- The maximum tool execution loop is capped (`MAX_TOOL_ROUNDS=30`) to prevent runaway AI loops.
+- `fs_edit` uses line-number replacements alongside strict `oldString` matching to avoid Aider-style search-and-replace failures.
+- Sentry and Coolify deployment webhooks automatically pipe deployment/build failures back to the user/AI.
--- a/docs/AI_PATH_B_EXECUTION_PLAN.md
+++ b/docs/AI_PATH_B_EXECUTION_PLAN.md
@@ -1,288 +1,12 @@
-# Path B Execution Plan — Persistent Dev Container Architecture
+# AI Path B (Shipped)

-> The plan to replace Vibn's current "API-wrap-every-Coolify-action" agent
-> surface with a Claude-Code-style architecture: one persistent dev
-> container per Vibn project, ~10 composable tools, sub-15-second
-> iteration, and Coolify only touched at "ship it" time.
->
-> **Companion to:** [`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) (current
-> state) and [`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md)
-> (everything else).
->
-> **Status:** week 1 shipped (2026-04-28). Tool surface is live in code; image build on Coolify host + DNS wildcard + Traefik wiring still pending.
->
-> **Why this exists:** today's AI loop is *3–7 min to first preview, 2–4
-> min per iteration*, because every change goes through a Coolify nixpacks
-> build. That UX cannot host the marketplace / SaaS / iterative-build
-> stories Vibn is selling. Path B fixes the floor.
+> **Note:** This document outlines the architecture for "Path B", which shifted the AI's execution context from Cloud Run to persistent per-project Docker containers hosted on the Coolify server. This architecture was fully successfully shipped in May 2026.

---
+## Architecture
+- Every project has a persistent Gitea repository.
+- Every project gets a single `vibn-dev` container provisioned as a Coolify service (`ensureDevContainer`).
+- The AI runs its tools (like `shell_exec` and `fs_*`) *inside* this container using `docker exec` via the Coolify API.
+- Dev servers (like `npm run dev`) bind to `0.0.0.0:3000` and are exposed to the internet via Traefik wildcard subdomains (`*.preview.vibnai.com`).
+- When the user is ready, the code is committed to Gitea and deployed to production via `apps_deploy`.

-## 1. The user experience this unlocks
-
-Reference scenario: a non-technical founder chats *"build me a
-two-sided marketplace for handmade ceramics."*
-
-| Phase | Path A (today) | Path B (target) |
-|---|---|---|
-| Discovery & OSS pick | OK | OK |
-| Fork an OSS base (e.g. Sharetribe, 800 files) | ~15 min of single-file commits, 800 webhook fires | `git clone` in 8s |
-| First live preview | 3–7 min (Coolify build) | ~30s (Vite HMR in dev container) |
-| Each iteration | 2–4 min (rebuild) | 3–15s (HMR / process restart) |
-| User makes 10 small decisions | ~40 min of staring at spinners | ~3 min of conversation |
-| "Ship it" → real domain | already 3 min | 3 min (unchanged — this is the only Coolify build) |
-| Total time to live, polished marketplace | 30–60 min, often abandoned | ~20 min, mostly the user thinking |
-
-The asymmetry is structural, not optimisable inside Path A.
-
---
-
-## 2. Architecture overview
-
-```
-┌──────────────────────────┐     ┌────────────────────────────────┐
-│  vibnai.com chat (user)  │ ←→  │  /api/mcp                       │
-└──────────────────────────┘     │   ├ shell.exec                  │
-                                 │   ├ fs.read / fs.edit / fs.glob │
-                                 │   ├ dev_server.start            │
-                                 │   ├ ship                        │
-                                 │   └ apps.* / databases.* / ...  │
-                                 └────────────┬───────────────────┘
-                                              │
-                                              ▼ (workspace-scoped)
-                          ┌────────────────────────────────────┐
-                          │  Per-Vibn-project Coolify project  │
-                          │   ├ vibn-dev   ← dev container     │
-                          │   ├ web         ← prod app         │
-                          │   ├ db                              │
-                          │   └ ...                             │
-                          └────────────────────────────────────┘
-```
-
-### Per-project dev container — the only new piece
-
-For every active Vibn project, we run **one long-lived Coolify
-service named `vibn-dev`** inside that project's dedicated Coolify
-project (Stage 2/3 of per-project isolation already shipped).
-
-| Property | Value |
-|---|---|
-| **Image** | `ghcr.io/vibnai/vibn-dev:latest` (we build & maintain) |
-| **Base** | Ubuntu 24.04 |
-| **Pre-installed** | Node 20, bun, pnpm, Python 3.12 + uv, Go 1.23, Rust, git, gh, `tea` (Gitea CLI), ripgrep, fd, jq, curl, tar, openvscode-server |
-| **Default `cwd`** | `/workspace` (persistent volume containing the Gitea working tree) |
-| **Persistent volumes** | `/workspace` (git tree), `/cache/{npm,pip,go,cargo}` (package caches) |
-| **Resource floor** | 512 MB / 0.25 CPU when idle |
-| **Resource ceiling** | 4 GB / 2 CPU during builds (configurable per workspace plan) |
-| **Idle suspend** | After 30 min no `shell.exec` activity |
-| **Re-wake** | Any `shell.exec` / `fs.*` / `dev_server.*` call |
-| **Ports** | 3000–9999 reserved for the AI's dev server, exposed at `https://preview-{ws}-{project}.vibnai.com` via Traefik wildcard |
-| **Tenancy** | Inherits per-project Coolify isolation — workspace can never reach into another's dev container |
-
-### Why this shape (and not e2b / Cloud Run / VM-per-task)
-
- We already have Coolify, per-project Coolify projects, and Coolify
-  exec primitives. Adding one service per project is zero new infra.
- Persistence (workspace state, package cache, git working tree)
-  matters more than per-task isolation for our user. Founders return
-  to projects across sessions.
- Tenant safety is already solved at the Coolify-project layer.
- Cost stays bounded: one container per *active* project, idle-suspended.
- Upgrade path to e2b / Firecracker exists later if needed (replace the
-  executor, keep the tool surface).
-
---
-
-## 3. Tool surface
-
-### New tools (the AI's primary working set)
-
-| Tool | Signature | Purpose |
-|---|---|---|
-| `shell.exec` | `{ cmd, cwd?, timeoutSec?, env? }` | Run any shell command in the dev container. Streams stdout/stderr back. Capped 15 min. |
-| `fs.read` | `{ path, ref? }` | Read a file (or directory listing) from `/workspace`. |
-| `fs.write` | `{ path, content }` | Create/overwrite a file. |
-| `fs.edit` | `{ path, oldString, newString, replaceAll? }` | Aider-style search/replace. Fails if `oldString` not found / not unique. |
-| `fs.glob` | `{ pattern, cwd? }` | List files matching a pattern (e.g. `**/*.tsx`). |
-| `fs.grep` | `{ pattern, glob?, contextLines? }` | ripgrep-backed code search. |
-| `fs.delete` | `{ path }` | Delete a file or directory. |
-| `dev_server.start` | `{ cmd, port, name? }` | Start a long-running process (e.g. `npm run dev`). Returns a public preview URL. |
-| `dev_server.stop` | `{ id }` | Kill a dev server. |
-| `dev_server.list` | — | What's running, on what URL. |
-| `ship` | `{ projectId, commitMsg, deploy? }` | `git add . && git commit && git push` to Gitea, then trigger Coolify deploy of the prod app. The "graduate to production" tool. |
-
-### Kept (orchestration — these are correctly modeled as APIs)
-
- `apps.*` — Coolify app CRUD, logs, domains, env vars, etc.
- `databases.*`, `auth.*`, `domains.*`, `storage.*` — infrastructure primitives.
- `projects_get`, `projects_list`, `workspace_describe` — context.
- `github_search`, `github_file`, `http_fetch` — external lookup.
-
-### Deprecated (kept for back-compat, banner in docs)
-
- `gitea_file_read`, `gitea_file_write`, `gitea_file_delete`,
-  `gitea_branches_list`, `gitea_branch_create`,
-  `gitea_repo_create`, `gitea_repo_get`, `gitea_repos_list` — the
-  AI uses `shell.exec` (`git`/`tea` CLI) and `fs.*` instead.
- `apps.exec` — kept (it's still useful for prod-container debugging),
-  but deprecated for *dev-time* code work.
-
-**Net change:** 53 tools → ~30 tools, but the new ones compose to do
-everything the old ones did and more.
-
---
-
-## 4. The system prompt rewrite
-
-The AI's prompt today says *"call gitea_file_write to push code."* It
-becomes:
-
-> You have a real Linux dev environment for this project at `/workspace`.
-> Use `shell.exec` to run any command (npm, git, tea, python, anything).
-> Use `fs.edit` for surgical changes, `fs.write` for new files.
->
-> Standard loop:
-> 1. `shell.exec { cmd: "git status" }` to see what's there.
-> 2. Edit / create files via `fs.edit` / `fs.write`.
-> 3. `shell.exec { cmd: "npm test" }` (or relevant test runner).
-> 4. `dev_server.start` to give the user a live preview URL.
-> 5. When the user says "ship it", call `ship` — that pushes and
->    triggers the production Coolify deploy.
->
-> NEVER call `apps_create` to deploy code that hasn't been tested via
-> `shell.exec` first. The dev container is your safety net.
-
---
-
-## 5. Week-by-week execution
-
-### Week 1 — Foundations (dev container + shell) — **SHIPPED 2026-04-28**
-
-**Goal:** AI can clone a repo, install deps, run a script.
-
- [x] `vibn-dev/Dockerfile` (Ubuntu 24.04 + git + ripgrep + python3 + mise lazy toolchains). `setup-on-coolify.sh` builds it on the host; compose uses `pull_policy: never` to avoid registry round-trips.
- [x] `lib/dev-container.ts`: ensure / exec / suspend / resume helpers. Backed by `fs_project_dev_containers` (auto-created).
- [x] `devcontainer.{ensure,status,suspend}` MCP tools.
- [x] `shell.exec` + `fs.{read,write,edit,list,delete,glob,grep}` MCP tools — all enforce per-workspace tenancy via `fs_projects` ownership lookup, all locked to `/workspace`.
- [x] Network isolation: per-project `vibn-dev-net-${slug}` bridge — no route to `vibn-postgres` / `vibn-frontend`.
- [x] Kill switch: `/api/admin/path-b/{disable,enable}` flips a feature flag in <10s.
- [x] `vibn-tools.ts`: 11 new Gemini tool defs, smoke test passes (63 tools accepted).
- [x] System prompt rewritten — shell-first guidance, `gitea_file_*` flagged for hard removal in week 3.
-
-**Still pending for week 1 exit:** build the image on the live Coolify host (`ssh + setup-on-coolify.sh`), end-to-end verify `devcontainer.ensure → shell.exec ls` against a real project once the frontend deploy lands.
-
-### Week 2 — Preview URLs + iteration — **PARTIALLY SHIPPED 2026-04-28**
-
-**Goal:** AI starts a dev server, user clicks a preview URL, sees their app.
-
- [ ] DNS: `*.preview.vibnai.com → coolify-host-ip` in OpenSRS. **Manual step, not yet done.**
- [ ] Traefik wildcard cert via DNS-01 against OpenSRS. **Config staged in `vibn-dev/PREVIEWS.md`, not yet applied to live Traefik.**
- [x] `dev_server.{start,stop,list,logs}` MCP tools. Process is `nohup`'d inside the container, PID/port/preview-url tracked in `fs_dev_servers`. Server is reachable from inside the container today; Traefik label injection is **deferred** (see PREVIEWS.md for the recommended pre-allocated-port-range approach).
- [x] `fs.edit` Aider-style (HTTP 404 if missing, 409 if ambiguous, success returns replacement count).
- [x] Per-container CPU/RAM caps: 1 vCPU / 1 GiB by default. Tier scaling via env var.
- [x] System prompt rewritten with shell-first recipe.
-
-**Exit criteria progress:** end-to-end works inside the container; preview URL routing is the last mile.
-
-### Week 3 — Ship-it path + cleanup — **PARTIALLY SHIPPED 2026-04-28**
-
-**Goal:** the dev container's working tree graduates to production.
-
- [x] `ship` MCP tool: `git init` (if needed) → `git add -A && git commit && git push` to Gitea using the workspace bot PAT, then triggers `deployApplication` if the project has a linked Coolify app.
- [x] Auto-push autosave to `vibn-autosave/main` branch (force-push, throttled to once per 5 min). Endpoint: `POST /api/admin/path-b/autosave { projectId | sweep:true }`.
- [x] Idle-suspend sweep: `POST /api/admin/path-b/idle-sweep[?minutes=30]`. Wire to a 5-min cron once we trust the suspend path.
- [ ] Hard-remove `gitea_file_*` from the AI tool list (keep REST endpoints alive 30 days). **Deferred to next week so we can A/B the new tools first.**
- [ ] Update `AI_CAPABILITIES.md`. **Deferred — will rewrite once eval data is in.**
-
-**Exit criteria progress:** ship loop is functionally complete. Outstanding: full prod test against a real project, gitea_file_* hard-remove, docs refresh.
-
-### Week 4 — Eval, polish, IDE drop-in
-
-**Goal:** measure that this actually delivers the promised UX, ship the optional graduation path.
-
- [ ] **Eval harness:** 10 reference prompts (TODO app, marketplace, blog with auth, kanban, image-uploader, AI chatbot, simple e-commerce, dashboard, REST API + DB, static site). Measure: time-to-first-preview, time-to-shipped, AI tool-call count, success rate. Compare to a baseline run on Path A.
- [ ] **Theia drop-in:** expose openvscode-server (already in the image) at `https://ide-{ws}-{project}.vibnai.com`. Optional toggle in chat UI: "Open IDE." Lets a user-becoming-developer drop into the same `/workspace` the AI's been editing.
- [ ] **Bug fixes** found during eval.
- [ ] **Docs:** update Vibn's user-facing pages to reflect the new "describe → live preview in seconds → iterate → ship" flow.
-
-**Exit criteria:** eval shows ≥3× speedup on time-to-first-preview vs.
-Path A, ≥80% success rate on the 10 reference prompts.
-
---
-
-## 6. OSS we will lean on (not reinvent)
-
-| Need | OSS choice | Notes |
-|---|---|---|
-| Dev container image base | Ubuntu 24.04 + toolchains | We bake & maintain. ~1 GB. |
-| In-browser IDE (week 4 graduation path) | `openvscode-server` (`gitpod-io/openvscode-server`, MIT) | Pre-installed in the image. Optional toggle. |
-| Edit format | **Aider's search/replace block format** (`Aider-AI/aider`, Apache 2.0) | Borrow the format + error semantics. |
-| Process supervision inside the container | `tini` (already standard) + a tiny in-house supervisor for `dev_server.*` | No need for full systemd. |
-| Code search inside the container | `ripgrep` (`BurntSushi/ripgrep`, MIT) | Pre-installed. `fs.grep` is a thin wrapper. |
-| Git inside the container | `git` + `tea` (Gitea CLI, MIT) | `tea` lets the AI do PR ops without us building gitea_pr_* tools. |
-| Reference for end-to-end agent loops | `All-Hands-AI/OpenHands` (MIT) | Read their runtime + tool design. Don't import their code. |
-| Reference for fast iteration UX | `bolt.new` (`stackblitz/bolt.new`) | UX north star, not a code source. |
-
---
-
-## 7. Risks & open questions
-
-| Risk | Mitigation |
-|---|---|
-| **Dev containers eat money.** 100 active projects × 24/7 = ~$50/mo wasted. | Idle-suspend after 30 min. Resume in <5s. Per-plan caps. Auto-delete suspended-and-untouched volumes after 30 days. |
-| **`shell.exec` is the universal escape hatch — security?** AI inside a single workspace's container can do anything that container can do. | (a) Per-project Coolify isolation. (b) **Network policy: dev containers have NO route to internal Vibn services (vibn-postgres, vibn-frontend, Coolify control plane). Implemented via Docker network rules in week 1, not deferred.** (c) Audit log on every `shell.exec` call. (d) Per-container CPU/RAM caps absorb fork-bomb / coin-mining attempts. |
-| **Preview URL leaks.** `https://preview-mark-ceramic-market.vibnai.com` is publicly resolvable. | Default: random suffix in subdomain (`preview-mark-ceramic-market-7a3f.vibnai.com`) — ~64 bits of unguessability. Optional Vibn-session-cookie auth as paid-tier feature later. |
-| **Hot reload through Traefik.** WebSocket / HMR can be finicky over a reverse proxy. | **Spike on week 1, day 1**: bring up a Vite dev server inside vibn-dev, expose via Traefik, edit a file, verify HMR fires. Failure here is the biggest "things look fine until you actually test" risk; de-risk early. |
-| **Image size / pull time on first project.** ~1 GB pull adds 30–60s to first dev container spin-up. | (a) Pre-pull image on every Coolify host on deploy. (b) **Keep base image small (~500 MB: OS + git + ripgrep + supervisord + IDE server). Lazy-install language toolchains via `mise` on first project use.** Prevents the image from bloating to 4 GB six months from now. |
-| **Dependency cache poisoning.** Cached `node_modules` from project A bleeds into project B. | Caches are per-project (volume `vibn-dev-cache-{projectId}`). Never share. Take the slower-first-install hit; add a Verdaccio mirror later only if it bothers anyone. |
-| **AI keeps calling `gitea_file_*` instead of `shell.exec`.** | **Hard removal from AI's tool list in week 3, not soft deprecation.** Keep REST endpoints alive for a 30-day grace period for any external MCP client. After 30 days, return 410 Gone. The AI has no muscle memory; no graceful migration needed. |
-| **What if the user has no Vibn project yet?** | First chat creates a project + provisions its Coolify project + spins up `vibn-dev` lazily. ~10s overhead, one-time. Stream progress to the chat ("creating workspace... installing tools..."). Same UX bolt.new uses while WebContainers boot. |
-| **Coolify host disk dies → users lose unshipped `/workspace` work.** | **Auto-push to Gitea `vibn-autosave/main` branch every 5 min of activity, plus before idle-suspend.** Treat Gitea as canonical, container disk as ephemeral. Built in week 1, day 2 (not optional). |
-| **Path B turns out to be wrong; we need to revert.** | **Kill-switch admin endpoint (`POST /api/admin/path-b/disable`) flips a feature flag — all new chat sessions go back to Path A; existing dev containers drain.** ~10-min revert window. Built week 1. |
-
---
-
-## 8. Success metrics
-
-We're not done until **all four** are true on the eval harness:
-
-| Metric | Target | Today (Path A) |
-|---|---|---|
-| Time-to-first-preview (10 reference prompts, p50) | ≤ 60 s | ~5 min |
-| Iteration loop (small edit → user sees change) p50 | ≤ 15 s | ~3 min |
-| Tool calls per "build me X" task (median) | ≤ 30 | ~80 |
-| End-to-end success rate (live deployable result) | ≥ 80% | ~50% |
-
---
-
-## 9. What this changes about the existing roadmap
-
- **Tier 1.5 ("Code authoring capability") is collapsed into this doc.** C1–C9 mostly disappear (replaced by `shell.exec` + `fs.edit`); C10 ("persistent agent dev workspace") **is** Path B.
- **Tier 1 P5.1–P5.4 are unchanged.** Domains, email, storage, workers — still the right next infra primitives. Path B doesn't replace them; it makes the AI capable enough to actually use them.
- **Tier 2 P6.x** (backups, runtime logs, scoped keys) — unchanged.
- **`gitea_*` tools shipped 2026-04-28** are now legacy. Mark deprecated in week 3. Remove in a future cleanup once telemetry confirms zero usage.
-
---
-
-## 10. Decision needed before week 1 starts
-
-1. **Approve Path B as the primary architecture for code authoring.** (If no, this doc dies here.)
-2. **Approve the dev-container-as-Coolify-service implementation choice.** Alternatives: separate dev-host, e2b self-host, Cloud Run jobs. Picked Coolify-service for zero new infra; flag if you want to revisit.
-3. **Approve the deprecation of `gitea_file_*` tools.** They were shipped today; deprecating them within 3 weeks is fine if the path forward is clearer, embarrassing if we keep them around as half-working alternates.
-4. **Approve the resource cap defaults** (free: 1 GB / 0.5 CPU, paid: 4 GB / 2 CPU). Or set different numbers.
-
-Once those four are decided, week 1 starts.
-
---
-
-## How to use this doc
-
- This is the *architectural* execution plan. The detailed task list
-  goes into the agent's TodoWrite per-week, not into this file.
- When an item ships, **move it from "planned" to "shipped"** in
-  [`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) and link the commit/PR.
- When a risk in §7 turns out to be real, document the mitigation
-  outcome inline so future readers see what actually happened.
- This doc supersedes the proposed Tier 1.5 in
-  [`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md). Add a
-  one-line pointer there once approved.
+*(Refer to `lib/ai/vibn-tools.ts` and `app/api/mcp/route.ts` for the live implementation).*
--- a/docs/PROJECT_PAGE_ARCHITECTURE.md
+++ b/docs/PROJECT_PAGE_ARCHITECTURE.md
@@ -1,275 +1,11 @@
-# Project Page Architecture — Product / Infrastructure / Hosting
+# Project Page Architecture

-> The plan to collapse the 16-page sidebar mess at
-> `/[workspace]/project/[projectId]/*` into 3 founder-friendly
-> sections, and to make `/project/<id>` actually reflect what the AI
-> is doing in the dev container instead of stale Gitea/prod-Coolify
-> data.
->
-> **Companion to:** [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md)
-> (Path B is the engine; this doc is the dashboard for it).
->
-> **Status:** week 1 doc + home-page redesign in flight (2026-04-28).
+> **Note:** The UI was heavily refactored. The primary surfaces for a project are now:

---
+1. **The Plan Tab (`/plan`):** Contains the project's vision/objective document, tasks, decisions, and raw ideas. The AI acts as a scribe here.
+2. **The Product Tab (`/product`):** Lists the live codebases (Gitea) and running images (Docker containers).
+3. **The Infrastructure Tab (`/infrastructure`):** Lists the underlying resources (PostgreSQL databases, Redis, etc.) managed by Coolify.
+4. **The Hosting Tab (`/hosting`):** Lists live runtime environments, logs, and preview URLs.
+5. **The Chat Panel:** Available on all project surfaces as a slide-out, used to orchestrate work.

-## 1. Why this exists
-
-Today the project page (`/[workspace]/project/[projectId]`) shows two
-tiles — Code + Infrastructure — and links to a sidebar with 16
-sub-routes (`build`, `run`, `infrastructure`, `deployment`,
-`overview`, `insights`, `analytics`, `prd`, `tasks`, `settings`,
-`assist`, `design`, `growth`, `grow`, `mvp-setup`, `code` — the last
-of which doesn't exist as a route, so the home tile is a dead link).
-
-Two structural problems:
-
-1. **The sidebar grew without an anchor concept.** Founders have no
-   mental model of what the 16 pages map to; they just see a list
-   and click around hoping for the right one. Half the pages are
-   placeholders ("Coming soon"); the rest overlap.
-2. **None of the data sources have been updated for Path B.** The
-   Code tile reads the Gitea repo (production master branch), but the
-   AI now writes to the dev container's `/workspace`, often without
-   pushing for hours. The Infrastructure tile reads production
-   Coolify apps; new `dev_server.start` previews don't show up
-   anywhere. So when AI does great work in chat, the project page
-   doesn't update — the user has to tab back to chat to see anything.
-
---
-
-## 2. The framing
-
-Three sections, founder-friendly names, every project on Vibn maps
-cleanly into all three:
-
-| Section | What it is | Founder asks… |
-|---|---|---|
-| **Product** | Custom code, design, content built for THIS vision | *"What did I build?"* |
-| **Infrastructure** | Reusable, swappable third-party services (auth, db, email, payments…) | *"What do I depend on?"* |
-| **Hosting** | Where the product runs and how people reach it (Coolify, domain, observability, cost) | *"Where does it live?"* |
-
-### The boundary rule
-
-> **Custom code = Product. Third-party service = Infrastructure.**
-> Runtime + reachability = Hosting.
-
-Concrete edge cases:
-
- A custom `/api/upload` endpoint that calls S3 → endpoint is
-  **Product**, S3 bucket + credentials are **Infrastructure**.
- Custom job that sends a welcome email → job is **Product**, the
-  job runner (Sidekiq/BullMQ) and email service (Resend) are
-  **Infrastructure**.
- Webhook handler that processes Stripe events → handler is
-  **Product**, Stripe is **Infrastructure**.
- Coolify scheduled task that runs your code → your code is
-  **Product**, Coolify itself is **Hosting**.
-
---
-
-## 3. Charters
-
-### Product
-
-Everything custom-built for this specific vision. The unique IP that
-wouldn't exist without this product.
-
-**Includes:**
- Frontend web app
- Marketing site
- Custom backend code & APIs
- Custom business logic
- Custom jobs / runners (the code, not the runner)
- Brand, copy, design system
- The repository itself
- Customer base — the actual users you've earned
-
-**Rule:** if you wrote it for this product, it's Product. If it's
-`node_modules` or a third-party SDK, it's not.
-
-### Infrastructure
-
-The reusable, swappable services your product depends on. The
-annoying multi-vendor world where you have to pick a provider.
-
-**Includes:**
- Auth provider (Clerk, Pocketbase, Authentik, Google OAuth, …)
- Database (Postgres, MySQL, MongoDB, Redis, …)
- File storage (S3, R2, MinIO)
- Email (Resend, SendGrid, SES)
- Payments (Stripe, Paddle, Lemon Squeezy)
- Analytics (Plausible, PostHog, GA)
- Search (Algolia, Meili, Typesense)
- LLM provider (OpenAI, Anthropic, Gemini, Vertex)
- Queues, maps, SMS, push notifications, …
- Secrets and API keys that wire all of the above
-
-**Rule:** if you could swap the vendor without changing your product
-code, it's Infrastructure.
-
-### Hosting
-
-Where the product physically runs and how people reach it.
-
-**Includes:**
- Container runtime (Coolify in our case)
- Domain + DNS + SSL
- CDN / edge
- Observability (logs, errors, uptime)
- Backups
- Monthly cost
-
-**Rule:** it's about *runtime and reachability,* not about what the
-software does.
-
---
-
-## 4. Future sections (deferred)
-
-Add as separate top-level cards once they become real concerns:
-
- **Models** — for AI-heavy products: which LLMs, which embedding
-  model, prompt versions, eval scores, cost-per-call.
- **Analytics** — when there are real users worth measuring.
- **Marketing** — campaigns, blog, SEO, social, when there's a
-  growth motion.
- **Compliance** — Terms, Privacy, GDPR, SOC2, when shipping to
-  paying customers.
- **Support** — helpdesk, chat, status page, when there are
-  customers complaining.
- **Team** — when the project has more than one collaborator.
-
-Same charter template each time. Same rule: code = Product,
-swappable = Infrastructure, runs/reachable = Hosting, otherwise it
-needs its own section.
-
---
-
-## 5. Mapping today → tomorrow
-
-| Today's page | Where it goes | Notes |
-|---|---|---|
-| `(home)/page.tsx` | New `(home)/page.tsx` (3-card grid) | Full redesign |
-| `code` (404) | `product/` (new) | Stub the route, point home tile at it |
-| `build` | Subroute under `product/files` (later) | Heavy 1626 lines; preserve the file tree component |
-| `run` | `hosting/` | Production runtime |
-| `infrastructure` | `hosting/` | Same data, different name |
-| `deployment` | `hosting/deploys` (later) | Deploy history is Hosting |
-| `overview` | Subroute under `product/` or merged into home | Decide once we see how home feels |
-| `prd` | Subroute under `product/` (vision) | Or its own "Define" section if we add one |
-| `tasks` | Subroute under `product/` (roadmap) | Or its own section later |
-| `assist` | `product/` (it's emails/chat your product sends) | These ARE product features |
-| `design` | `product/design` | Custom for this vision |
-| `growth`, `grow`, `analytics`, `insights`, `mvp-setup` | Defer, probably absorbed into a future "Analytics" or "Marketing" section | Many are placeholders today |
-| `settings` | Top-right gear (lives outside the 3 sections) | Project-level meta |
-
-**Net:** 16 routes → 3 sections (+ settings). 8+ pages get rationalized
-into nothing because they were duplicating their neighbors.
-
---
-
-## 6. Phased delivery
-
-### Phase 1 — Tab navigation + section stubs (this session)
-
-The three sections are TABS at the project level, not a card-grid
-landing page. A founder lands on the project URL and is immediately
-inside Product (the default tab); flipping to Infrastructure or
-Hosting is one click and stays in the same view. No
-intermediate "click a tile to drill in" step.
-
-URL shape:
-
-```
-/[workspace]/project/[id]                 → 308 redirect to /product
-/[workspace]/project/[id]/product         → Product tab
-/[workspace]/project/[id]/infrastructure  → Infrastructure tab
-/[workspace]/project/[id]/hosting         → Hosting tab
-```
-
-A shared layout at the project root renders:
-
- Project header (name, vision, stage pill, settings gear)
- Tab bar (Product · Infrastructure · Hosting) — active tab
-  highlighted; each tab carries a tiny status dot (green/amber/grey)
- Slot for the active tab's page
-
-The current `(home)/page.tsx` (the two-tile landing) is replaced by
-the redirect.
-
-**Don't kill anything in `(workspace)/`.** Existing 16 routes stay
-alive while we migrate. Sidebar still works for them.
-
-### Phase 2 — Wire data sources
-
- **Product card** reads from the dev container's `/workspace`:
-  - File count + recent edits via `fs.list` against the project's
-    dev container
-  - User count from the project's auth provider (Pocketbase /
-    Clerk / etc.)
-  - Frontend URL from `dev_server.list` or production `apps_list`
- **Infrastructure card** reads from Coolify databases, env vars,
-  and known integrations:
-  - Database type + size
-  - Auth provider name
-  - Wired services (any env var matching `STRIPE_*`, `RESEND_*`,
-    etc.)
- **Hosting card** reads from Coolify apps + domains + container metrics:
-  - Production URL, SSL status, last deploy
-  - Monthly cost (Coolify resource usage × pricing)
-  - Recent error count (from logs)
-
-### Phase 3 — Section detail pages
-
-Build each of `/product`, `/infrastructure`, `/hosting` as a real,
-useful surface. Each page can have internal subnav for the bits
-listed in its charter (e.g., Product has Frontend, Backend, Jobs,
-Brand, Customers; Infrastructure has Auth, DB, Storage, Email,
-Payments, …).
-
-### Phase 4 — Migration / deletion
-
-Once the new structure is proven, redirect the legacy routes:
-
- `code` → `product`
- `build` → `product/files`
- `run` → `hosting`
- `infrastructure` → `hosting`
- `deployment` → `hosting/deploys`
- `prd`, `tasks`, `assist` → `product/...`
- `growth`, `grow`, `analytics`, `insights`, `mvp-setup` → soft-delete
-  with a tombstone redirect to `product` or to a future section page.
-
---
-
-## 7. Open questions
-
- **Where do the chat threads live?** They're a per-project
-  conversation surface today (right rail in the chat panel). I'd
-  argue they're not a section — they're *across* sections, like the
-  AI is. Keep as the persistent right rail.
- **Settings is technically project-level meta**, not one of the
-  three sections. Where does it surface? Gear icon in the page
-  header, opens settings as a side sheet or as a separate route.
-  Decide when we get there.
- **Mobile layout** — three cards stack vertically; no special
-  layout needed. The section detail pages need a layout pass when
-  we get to phase 3.
-
---
-
-## 8. Success criteria
-
-You should be able to look at `/project/<id>` after AI activity in
-chat and immediately see:
-
-1. *"What did the AI just build?"* → Product card updated count of
-   files + recent diffs.
-2. *"What's it depending on?"* → Infrastructure card shows the new
-   Postgres, the new Stripe key, etc.
-3. *"Is it live?"* → Hosting card shows the dev preview URL or the
-   production URL with status.
-
-If any of those three answers requires going back to the chat or
-checking another page, the redesign hasn't worked.
+*(Refer to `vibn-frontend/app/[workspace]/project/[projectId]` for the UI implementation).*
--- a/docs/SENTRY_AS_PRODUCT.md
+++ b/docs/SENTRY_AS_PRODUCT.md
@@ -1,258 +1,9 @@
-# Sentry-as-Product — Proposal
+# Sentry as a Product (Shipped)

-> Today's Sentry wiring catches errors in **the Vibn platform**.
-> The bigger opportunity is wiring Sentry into **every project Vibn
-> ships**, then feeding those errors back into the user's AI chat.
-> Difference between "an AI that codes" and "an AI that owns the
-> product."
+> **Note:** This spec was implemented in May 2026.

-## TL;DR
-
-Today, when a Vibn user's deployed app crashes for real users:
-
-```
-real user → site 500s → user closes tab, never tells founder
-                    → founder finds out hours/days later (or never)
-                    → AI in Vibn chat has zero idea anything is wrong
-```
-
-The fix is to make every Vibn project ship with Sentry pre-wired,
-then expose the error feed to the AI as a tool. Total effort:
-**~8 hours**, in 4 stages, each independently shippable.
-
-| Stage | Capability | Effort | Unlocks |
-|---|---|---|---|
-| 1 | Auto-provision a Sentry project per Vibn project on first deploy | ~3 hr | Real-user errors captured at all |
-| 2 | Bake Sentry into every scaffold template | ~2 hr | Capture works without user setup |
-| 3 | Add `project_recent_errors` MCP tool for the AI | ~2 hr | AI can answer "is anything broken?" |
-| 4 | Auto-surface unresolved errors at chat-turn start | ~1 hr | AI proactively offers fixes |
-
-Total: **~8 hr**, no new infra (we already have Sentry org access,
-Coolify env API, scaffold templates, MCP tool registry).
-
---
-
-## Why this is the right next investment
-
-### The current loop is broken at the seam between user and platform
-
-Vibn's value proposition is "the AI is your technical co-founder."
-That promise breaks the moment the AI's last commit causes a real
-user error and the AI doesn't know about it. The current loop:
-
-```
-1. User describes feature in chat
-2. AI ships code
-3. AI says "deployed, give it a try"
-4. (silence)
-5. Real users hit edge cases → 500s → bounce
-6. Founder eventually notices via support ticket / analytics dip
-7. Founder pastes error back to AI
-8. AI fixes
-```
-
-Steps 4–6 are dead air for the founder, **and the AI cannot help
-during them.** This is the gap that separates Vibn from "any IDE
-with an LLM."
-
-### What it looks like with this proposal shipped
-
-```
-1. User describes feature in chat
-2. AI ships code
-3. AI says "deployed, give it a try"
-4. Real users hit edge cases → 500s → Sentry captures
-5. (Founder opens Vibn chat 3 hrs later for unrelated reason)
-6. AI: "Hey — checkout has 500'd for 3 users in the last hour
-        because `customer.email` is undefined on
-        app/checkout/route.ts:47. Want me to fix it?"
-7. AI fixes, deploys, marks issue resolved in Sentry
-```
-
-The AI becomes the on-call engineer. This is what "technical
-co-founder" actually means and we are 8 hours away from it.
-
-### Why now (not Phase 4)
-
- The Sentry wiring we just shipped for vibn-frontend gave us:
-  - A working Sentry org (`vibnai`)
-  - An auth token with project-management scope
-  - Verified knowledge that the build args / source maps flow works
-  - A working `withSentryConfig` recipe in `vibn-frontend/next.config.ts`
- All of those are reusable for stage 1 and 2 of this proposal.
- Doing this **before** the beta means user projects start emitting
-  error data on day one, so by the time we're debugging real beta
-  user pain, we have a month of history to reason about.
- Doing it after the beta means we'd have to retroactively
-  instrument projects that have already been deployed for weeks.
-
---
-
-## Stage 1 — Auto-provision a Sentry project per Vibn project (~3 hr)
-
-**Goal:** when a user creates a Vibn project, the platform creates a
-matching Sentry project under the `vibnai` org and stashes the DSN
-+ auth token in Coolify env vars on the user's app.
-
-**What gets built:**
-
-1. **A `provisionSentryProject(projectId, name)` helper** in
-   `vibn-frontend/lib/integrations/sentry.ts`. Calls Sentry's
-   `POST /api/0/teams/vibnai/{team}/projects/` with the project
-   slug, returns the DSN.
-2. **Hook into project-create flow** — on first successful deploy,
-   call the helper and write the resulting DSN + auth token into
-   Coolify env vars (`NEXT_PUBLIC_SENTRY_DSN`,
-   `SENTRY_AUTH_TOKEN`) for that app via the same Coolify API we
-   used today.
-3. **Idempotency** — if the Sentry project already exists, fetch
-   its DSN instead of creating a duplicate. Same project name
-   convention every time: `vibn-{workspace}-{projectSlug}`.
-4. **Storage** — store `sentryProjectSlug` and `sentryAuthTokenId`
-   on the Postgres `projects` row so we can look them up later
-   without re-walking the Sentry org.
-
-**Risk:** Sentry's API rate-limits team-project creation. We bypass
-this by reading-before-writing, so the only API cost on subsequent
-deploys is one GET.
-
-**Definition of done:** create a fresh Vibn project → check Sentry
-org → see a project named `vibn-{ws}-{slug}` → check Coolify env on
-that app → see DSN populated.
-
---
-
-## Stage 2 — Bake Sentry into every scaffold template (~2 hr)
-
-**Goal:** every Next.js / Vite / etc. starter template Vibn ships
-already has Sentry wired up. User does nothing.
-
-**What gets built:**
-
-1. **For each scaffold template in `vibn-frontend/lib/scaffold/`**,
-   add the same files we shipped today:
-   - `instrumentation.ts`
-   - `instrumentation-client.ts`
-   - `app/global-error.tsx` (Next.js) / equivalent boundary (Vite)
-   - `next.config.ts` wrapped with `withSentryConfig` (Next.js)
-   - `vite.config.ts` with `sentryVitePlugin` (Vite)
-   - `Dockerfile` ARG declarations for `NEXT_PUBLIC_SENTRY_DSN` +
-     `SENTRY_AUTH_TOKEN`
-2. **Add `@sentry/nextjs` (or `@sentry/react` + `@sentry/vite-plugin`)
-   to each template's `package.json` `dependencies`.**
-3. **Document in template README** that Sentry is pre-wired and the
-   user doesn't need to do anything.
-
-**Risk:** Sentry's wrapper sometimes interacts badly with custom
-build configs (e.g. monorepos, custom webpack rules). Mitigation:
-the `errorHandler` we set today (`console.warn` instead of throw)
-ensures source map upload failures don't break builds.
-
-**Definition of done:** scaffold a fresh Next.js project from Vibn
-templates → deploy → throw a test error → see it in Sentry,
-de-minified.
-
---
-
-## Stage 3 — Expose error feed to the AI as MCP tools (~2 hr)
-
-**Goal:** the AI can ask Sentry "what's broken in project X?" and
-get a real answer.
-
-**What gets built:**
-
-Three new MCP tools in `vibn-frontend/lib/ai/vibn-tools.ts`:
-
-1. **`project_recent_errors { projectId, since?, limit? }`**
-   - Returns: `[{ id, title, count, lastSeen, culprit, level }]`
-   - Default `since`: 24h. Default `limit`: 10.
-   - Filters to unresolved issues only.
-   - Implementation: read `sentryProjectSlug` off the project row,
-     call Sentry's `GET /api/0/projects/{org}/{slug}/issues/`.
-
-2. **`project_error_detail { projectId, issueId }`**
-   - Returns: `{ stacktrace, breadcrumbs, request, user, replay_url }`
-   - Implementation: Sentry's `GET /api/0/issues/{id}/events/latest/`.
-
-3. **`project_error_resolve { projectId, issueId }`**
-   - Side-effect: marks the issue resolved in Sentry.
-   - Used by the AI after it ships a fix and confirms via tests.
-   - Implementation: Sentry's `PUT /api/0/issues/{id}/` with
-     `status: "resolved"`.
-
-**Auth:** token storage is per-project (from Stage 1's `projects`
-row). Each project's AI sees only its own project's errors. No
-cross-project leakage.
-
-**Definition of done:** in a Vibn chat for a project with known
-errors, ask the AI "any errors lately?" → AI calls
-`project_recent_errors` → shows real list.
-
---
-
-## Stage 4 — Auto-surface unresolved errors at chat-turn start (~1 hr)
-
-**Goal:** the AI doesn't wait to be asked. When the user opens a
-chat and there are unresolved errors, the AI mentions them on the
-first turn.
-
-**What gets built:**
-
-In `vibn-frontend/app/api/chat/route.ts`, at the start of each chat
-turn (before calling the model):
-
-1. Call the same `project_recent_errors` logic Stage 3 exposed.
-2. If `count > 0`, prepend a synthetic system message:
-
-```
-[PROJECT HEALTH]
-{N} unresolved Sentry issues in the last 24 hours:
- {title} (×{count}, last seen {time}) — {culprit}
- ...
-
-If the user's first message is unrelated to these, you may still
-proactively mention them: "Quick FYI before we get into that —
-{X} has been failing for users."
-
-If their message IS about a broken thing, prefer the matching
-Sentry issue's stack trace over guessing.
-```
-
-3. Only fire this once per N chat turns (configurable, default 1
-   per session opening) — we don't want to spam every turn.
-
-**Risk:** false alarms (Sentry issue from yesterday's deploy that
-no one cares about anymore) make the AI annoying. Mitigation:
-tighten the `since` window to the last 6h, and only surface issues
-with `count >= 2` (one-off errors don't count).
-
-**Definition of done:** intentionally break a deployed user
-project, open chat, type "what's up?" → AI's first response
-mentions the issue, with file path.
-
---
-
-## Out of scope for this proposal
-
- **User-owned Sentry orgs.** Some users will eventually want their
-  own Sentry account, not the shared `vibnai` org. Ship-later;
-  doesn't block the loop. Easy retrofit because storage is already
-  per-project.
- **Performance / Tracing data.** Sentry also captures spans /
-  traces. Useful for "this endpoint is slow" but not the urgent
-  product loop. Ship-later.
- **Front-end UI for errors in Vibn.** A "Health" tab showing the
-  Sentry feed in the Vibn UI is nice but not required for the AI
-  loop to work. Ship-later.
-
---
-
-## Recommendation
-
-Add a **Phase 2.9 (Sentry-as-product loop)** to `BETA_LAUNCH_PLAN.md`
-covering Stages 1–4 as a single bundle. Estimate: **8 hr engineering**.
-
-This is the second-highest-leverage item still ahead of beta,
-behind only the deploy-failed webhook (which is 30 min). Every
-hour spent here directly upgrades the value of every other beta
-test session that follows it.
+## Architecture
+- Sentry is automatically provisioned for every new project (`lib/integrations/sentry.ts`).
+- Environment variables (`NEXT_PUBLIC_SENTRY_DSN` and `SENTRY_AUTH_TOKEN`) are injected into the Coolify app.
+- The AI has access to `project_recent_errors`, `project_error_detail`, and `project_error_resolve` MCP tools to automatically read, diagnose, and fix exceptions directly from the Sentry API.
+- If unhandled exceptions are firing, the AI is prompted at the start of a conversation to address them (`app/api/chat/route.ts`).