docs: heavily compress and simplify remaining reference files to represent current state

This commit is contained in:
2026-05-07 15:07:31 -07:00
parent 3563b98de1
commit 057115a9fc
8 changed files with 58 additions and 2926 deletions

View File

@@ -1,292 +1,5 @@
# Agent telemetry & live execution stream — project spec
# Agent Telemetry Streaming (Historical)
This document captures **concrete product and engineering additions** discussed for Vibn: moving from **poll-based session updates** and **in-memory jobs** to a **durable, ordered, push-friendly execution timeline**—the web equivalent of a terminal agents clarity (step-by-step visibility, tool boundaries, failures, and later multi-agent signals).
> **Note:** This historical spec covered the implementation of real-time streaming for the AI agent loop (Server-Sent Events) and timeline rendering.
---
## 1. Why this exists
### Current behavior (baseline)
| Surface | How progress reaches the user | Limits |
|--------|------------------------------|--------|
| **Agent sessions** (`agent_sessions`) | Runner `PATCH`es `output`, `status`, `changed_files` to Next; UI **polls** `GET …/agent/sessions/[id]`. | Latency, reconnect story, no single ordered stream; rich semantics encoded only in `text`. |
| **Jobs** (`/api/agent/run`, `/api/jobs/:id`) | In-memory `job-store` (`progress`, `toolCalls[]`); UI polls job endpoint. | Lost on restart; not shared across runner replicas; not unified with session UI. |
| **Orchestrator / Atlas chat** | Request/response to runner; advisor path may be remote URL. | No execution timeline for “long COO run” in-product unless you add the same event layer. |
### Product intent
- **Trust during long runs**: users see *what* happened, *when*, and *whether something was blocked*—not only a final status.
- **Differentiation**: “Ink-like” clarity in the browser—structured steps, not a blob of logs.
- **Foundation for multi-agent**: handoffs, child work, and safety events need a **common event pipe**, not ad-hoc strings.
---
## 2. Goals
1. **Append-only execution events** with **monotonic ordering** (per session or per job), suitable for replay after refresh.
2. **Server-push to the client** (recommend **SSE** first; WebSocket if you need bi-directional on the same channel).
3. **Persistence** so reconnect, refresh, and horizontal scaling do not lose history.
4. **Single conceptual model** (`AgentEvent`) usable by:
- Build → **Agent** tab (sessions),
- **Job** flows (create/analyze-style),
- optionally **orchestrator** long runs later.
5. **Backward compatibility** during rollout: existing `PATCH` + `output` can remain as a fallback or be fed from the same emitter.
### Non-goals (for v1)
- Full **OpenTelemetry** export (optional later).
- **Real-time collaborative** multi-user cursors on the same session.
- Merging **claude-code-fork**—this spec is **API + UI + persistence** only.
---
## 3. Concept: `AgentEvent`
### Core shape (suggested)
```ts
type AgentEvent = {
seq: number; // monotonic per stream (session_id or job_id)
ts: string; // ISO-8601
runId: string; // session UUID or job id — ties events to a run
runKind: 'session' | 'job';
phase: 'queued' | 'running' | 'completed' | 'failed' | 'stopped';
type: AgentEventType;
payload: Record<string, unknown>; // type-specific
};
type AgentEventType =
| 'run.started'
| 'run.phase' // e.g. planning, executing, committing
| 'llm.turn.start'
| 'llm.turn.end'
| 'tool.start'
| 'tool.end'
| 'tool.output' // chunked stdout/stderr if needed
| 'safety.block' // policy / protected path / command denied
| 'file.changed' // maps to todays changed_files semantics
| 'git.commit'
| 'deploy.triggered'
| 'deploy.status'
| 'error'
| 'run.completed'
| 'handoff' // v2: parent → child agent
| 'child_job.started' // v2: linked run id
;
```
### Mapping from todays session `outputLine`
| Today (`outputLine.type`) | Suggested event(s) |
|---------------------------|--------------------|
| `step` / `info` | `run.phase` or `llm.turn.*` with summary in `payload.message` |
| `stdout` / `stderr` | `tool.output` or dedicated stream events |
| `error` | `error` + optional `safety.block` if policy-driven |
| `done` | `run.completed` |
Keep **human-readable `message`** on events for UI defaults; add **structured fields** (`tool`, `argsSummary`, `durationMs`) for timeline rendering and filters.
---
## 4. Architecture (high level)
```mermaid
flowchart LR
subgraph runner [vibn-agent-runner]
RA[runSessionAgent / runAgent]
EMIT[emitAgentEvent]
end
subgraph api [vibn-frontend Next.js]
ING[POST internal ingest or PATCH extend]
DB[(Postgres agent_events)]
SSE[SSE GET /api/.../stream]
end
subgraph browser [Browser]
UI[Timeline + live log]
end
RA --> EMIT
EMIT -->|HTTPS + secret or mTLS| ING
ING --> DB
UI -->|EventSource| SSE
SSE --> DB
```
**Principles**
- **Runner remains stateless** regarding “truth”: it emits events; **Next + DB** are the source of truth for the UI (matches todays session model).
- Alternatively, runner could expose **SSE directly**—usually worse for **auth**, **CORS**, and **one domain** for the product. Prefer **Next as SSE endpoint** reading from DB.
---
## 5. Backend: `vibn-agent-runner`
### 5.1 Emit from execution paths
| Location | Action |
|----------|--------|
| `agent-session-runner.ts` | Replace or supplement `patchSession` output-only updates with **`emitAgentEvent`** each turn / tool / error. |
| `runAgent` / tool loop (`executeTool`) | Same emitter for **job** runs. |
| `server.ts` `/agent/execute` | Emit `run.started` after 202; `run.completed` / `error` on exit. |
| Security / blocked tools (`security.ts` or equivalent) | Emit `safety.block` with reason code (no secrets in payload). |
### 5.2 Transport runner → Next
**Option A (recommended):** extend existing **PATCH** or add **`POST /api/internal/agent-events`** (or per-session batch append):
- Headers: `x-agent-runner-secret` (same as todays PATCH).
- Body: single event or small batch `{ events: AgentEvent[] }` with server-assigned `seq` to avoid races.
**Option B:** Runner writes to **Redis/Postgres** directly—couples runner to DB credentials; only do if you already run runner inside the same trust zone with DB URL.
### 5.3 Jobs store
- **Short term:** continue in-memory for job metadata; **persist events** to Postgres keyed by `jobId`.
- **Medium term:** optional **Redis** for job status + pub/sub to Next for low-latency SSE fanout (only if DB polling becomes a bottleneck).
---
## 6. Backend: `vibn-frontend` (Next.js)
### 6.1 Persistence
**New table (example): `agent_run_events`**
| Column | Notes |
|--------|--------|
| `id` | UUID |
| `run_id` | Session id or job id (text) |
| `run_kind` | `'session' \| 'job'` |
| `seq` | BIGSERIAL or per-run sequence enforced with unique constraint `(run_id, seq)` |
| `project_id` | Nullable for jobs if not scoped |
| `event` | JSONB — full `AgentEvent` or `{ type, ts, payload }` |
| `created_at` | default now() |
Index: `(run_id, seq)` for range queries (`WHERE run_id = $1 AND seq > $lastSeen`).
**Optional:** migrate legacy `agent_sessions.output` to be **derived** (last N lines for email export) or **dual-write** during transition.
### 6.2 SSE route (example contract)
- **`GET /api/projects/[projectId]/agent/sessions/[sessionId]/events/stream`**
- Auth: session cookie / same as GET session (user must own project).
- Query: `?afterSeq=123` for replay.
- Response: `text/event-stream`; each message: `data: {JSON}\n\n`.
- Heartbeat comments every ~1530s to keep proxies alive.
For **jobs** (if not project-scoped): `GET /api/jobs/[jobId]/events/stream` with appropriate auth.
### 6.3 Ingest route (runner-only)
- **`POST /api/internal/agent-events`** (or nested under project/session as you prefer).
- Validates `x-agent-runner-secret`.
- Inserts rows with **server-generated `seq`** (transaction per run or advisory lock per `run_id`).
---
## 7. Frontend (product UI)
### 7.1 Agent tab — timeline
- **EventSource** (SSE) subscription when session is `running`; on load, **fetch historical** events (`GET …/events?afterSeq=0` or SSE from 0).
- **Timeline components**:
- Group by `llm.turn` / `tool.start``tool.end`.
- Expandable tool args (sanitized).
- Distinct styling for `safety.block` and `error`.
- **Reconnect**: on `EventSource` error, reopen with `lastSeq` from last received event.
### 7.2 Jobs / analyze flows
- Same timeline component keyed by `jobId` if you surface those runs in UI.
- Unifies mental model: “every run has a stream.”
### 7.3 Deprecate slow polling
- Reduce `GET …/agent/sessions/[id]` poll interval when SSE connected; keep **single poll** for `status` / `changed_files` if those stay on session row only, or **also** emit `file.changed` events and drive UI from stream + one final consistency read.
---
## 8. Security & privacy
- **Never** put tokens, env values, or full file contents in events by default; use **truncation** and **hashes** where needed.
- **`safety.block`**: log reason **code** + user-safe message; align with `security.ts` behavior.
- **Rate limits** on ingest endpoint (per `run_id` / per IP) to avoid abuse if misconfigured.
---
## 9. Environment variables
| Variable | Where | Purpose |
|----------|--------|---------|
| `AGENT_RUNNER_SECRET` | Runner + Next | Ingest / extended PATCH auth |
| `VIBN_API_URL` | Runner | Base URL for callbacks |
| `AGENT_RUNNER_URL` | Next | Start runs (unchanged) |
Add if needed:
| Variable | Purpose |
|----------|---------|
| `AGENT_EVENTS_INGEST_PATH` | Optional override for ingest URL |
| `SSE_MAX_BUFFER` | Cap replay batch size |
---
## 10. Phased roadmap (suggested)
### Phase 1 — Foundation
- [ ] Define `AgentEvent` TypeScript types in a **shared package** or duplicated minimal types in runner + frontend.
- [ ] Create `agent_run_events` (or equivalent) + migration.
- [ ] Implement **ingest** endpoint; wire **runner session path** to emit core events: `run.started`, `tool.start` / `tool.end`, `error`, `run.completed`, `file.changed`.
- [ ] **Dual-write**: keep existing `PATCH` `outputLine` so nothing breaks.
### Phase 2 — Push
- [ ] SSE route + **EventSource** in Agent tab.
- [ ] Backfill UI from DB on mount; then live tail.
- [ ] Lower or gate polling on `GET` session.
### Phase 3 — Jobs + durability
- [ ] Emit same events from **job** execution path; persist by `jobId`.
- [ ] Optional: replace in-memory job list with DB for **multi-instance** runner (later).
### Phase 4 — Rich semantics
- [ ] `safety.block` from policy layer.
- [ ] `deploy.*` events if Coolify integration is user-visible.
- [ ] **Multi-agent**: `handoff`, `child_job.*` with links in payload.
---
## 11. Success metrics
- Time-to-first-visible-step after **Run** &lt; **1s** p95 (SSE).
- After hard refresh mid-run, user sees **consistent history** (no duplicate seq, no gaps if you guarantee at-least-once ingest with idempotency keys later).
- Support tickets / confusion drops on “what is the agent doing?” (qualitative).
---
## 12. Related code (repo anchors)
Use these when implementing:
- Runner session loop + PATCH bridge: `vibn-agent-runner/src/agent-session-runner.ts`
- Runner HTTP: `vibn-agent-runner/src/server.ts` (`/agent/execute`, `/agent/stop`, `/agent/approve`, `/api/agent/run`, `/api/jobs/:id`)
- In-memory jobs: `vibn-agent-runner/src/job-store.ts`
- Next session API + runner callback: `vibn-frontend/app/api/projects/[projectId]/agent/sessions/[sessionId]/route.ts`
- Session create + fire-and-forget execute: `vibn-frontend/app/api/projects/[projectId]/agent/sessions/route.ts`
---
## 13. Open decisions
1. **Single table** for sessions + jobs vs **two tables** (simpler queries vs flexibility).
2. **Seq generation**: DB sequence per `run_id` vs global monotonic with `(run_id, seq)` composite only in app logic.
3. **Idempotency**: runner retries may duplicate events—use **`event_id` UUID** from runner for dedupe on ingest.
4. **Orchestrator chat**: treat as v2 unless you need a **COO run** timeline immediately.
---
*Document version: 1.0 — aligned with discussion of runner ↔ frontend telemetry, SSE-first delivery, Postgres persistence, and future multi-agent event types.*
The streaming system is fully implemented in `app/api/chat/route.ts` and rendered in the frontend via `Timeline`, `ThinkingBubble`, and `TimelineToolGroup` components inside `chat-panel.tsx`.

View File

@@ -1,673 +1,5 @@
# Vibn AI Capability Roadmap
# AI Capabilities Roadmap (Historical)
> **⚠ See also:** [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md)
> — proposed pivot to a Claude-Code-style persistent dev container per
> project. Once approved, that doc supersedes any "code authoring" item
> in this roadmap; this file remains the source of truth for
> infrastructure primitives (P5.x, P6.x, P7.x).
>
> The ordered plan for closing the gap between what the Vibn agent can do
> today and what it needs to do for a real customer to ship, operate, and
> scale a SaaS through it.
>
> **Companion to:** [`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) (current state).
>
> **Prioritization framing:**
> 1. Does it unblock *shipping a real product* (not a demo)?
> 2. Does it unblock *surviving past the first paying customer*?
> 3. Does it only matter once usage scales?
>
> Tier 1 = (1). Tier 2 = (2). Tier 3 = (3). Tier 4 = revisit when demanded.
>
> **Sequencing rule:** complete Tier 1 before any Tier 2 item. The trap
> is polishing safety rails (audit, scopes, quotas) before the product is
> actually shippable.
> **Note:** This is a historical roadmap document. Most of the core Path B capabilities (persistent dev containers, Gitea mirroring, Traefik wildcard proxies) have been successfully shipped.
---
## 0. Substrate & constraints
Vibn runs on a two-cloud substrate, constrained to Canadian data residency:
| Layer | Provider | Region | Purpose |
|---|---|---|---|
| **App hosting** | Coolify (self-managed) | Montreal VPS | All app / database / auth containers. Current state. |
| **Managed services** | **Google Cloud** | `northamerica-northeast1` (Montreal) | Object storage, cron, queues, logs, backups, monitoring, secrets. |
| **Domain registration** | OpenSRS (Tucows) | Toronto | Wholesale domain API. Canadian company, pre-funded float account. |
| **Authoritative DNS** | Cloud DNS (default) / CIRA D-Zone (strict) | Global anycast / Canadian | Managed DNS for workspace-owned domains. |
| **Transactional email** | Amazon SES | `ca-central-1` (Montreal) | No GCP equivalent; AWS's Canadian region keeps data in-country. |
**Absolute rule: no customer data leaves Canada.** Every workspace-owned
resource (storage bucket, database, log bucket, task queue, scheduler
job, email message body) must be pinned to a Canadian region.
### Why mix clouds?
- **Coolify stays** because we already built the workspace-scoped
provisioning around it (Phase 4). Migrating apps to Cloud Run is a
rewrite we don't need.
- **GCP-CA** fills every managed-service gap Coolify has. Cheaper and
more reliable than self-hosting MinIO/Loki/scheduler.
- **AWS SES for email** because GCP has no first-party transactional
email service and SES `ca-central-1` is the only credible
Canadian-resident managed option.
- **OpenSRS for domains** because it's the wholesale API behind most
Canadian registrars, and we already have the deposit.
### Compliance upgrade path (Tier 4 territory)
For regulated customers (healthcare, financial, public sector):
- **Assured Workloads for Canada** on GCP — enforces Canadian personnel
access + data residency contractually.
- **CIRA D-Zone** instead of Cloud DNS — first-party Canadian managed DNS.
- Keep the SES and OpenSRS pieces as-is (already Canadian-resident).
Document the caveat on a public trust page. Build the Assured-Workloads
variant when a real customer asks.
---
## Current state (Phase 4 + P5.1 verified, Apr 2026)
- Workspace tenancy: Gitea org + Coolify project + SSH deploy key per
workspace.
- Agent can: create repos, create apps, provision 8 database flavors,
deploy 8 vetted auth providers, manage env vars, deploy + poll,
update, delete (with `?confirm=<name>`), set domains under
`*.{slug}.vibnai.com`.
- Control-plane MCP: 24 tools + full REST surface at `/api/mcp`.
API-key scoped per workspace.
- **P5.1 custom apex domains** — OpenSRS + Cloud DNS + Coolify
lifecycle (search / register / attach / inspect) shipped and
verified end-to-end against PROD GCP + OpenSRS sandbox + PROD
Coolify on `v4.0.0-beta.473` (2026-04-22). All 5 sub-systems green
in `smoke-attach-e2e.ts`: register → zone → A records → registrar
NS update → Coolify `fqdn` patch → cleanup. Required a server-side
config fix on `coolify-server-mtl` (proxy.type=TRAEFIK,
is_build_server=false) so `Server::isProxyShouldRun()` returns
true and the controller maps `domains``fqdn` — see
[`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) § 3.6 for the gory details.
- **Agent-runner stdio MCP bridge** — `vibn-agent-runner` now exposes
its full in-house toolkit (28 tools) outward over 5 stdio MCP
servers so external clients (Cursor, Claude Desktop, Goose) can
drive the same Coolify / Gitea / workspace / memory / search /
sub-agent surface as the internal Coder/PM/Marketing agents, with
shared protected-repo + protected-app guardrails. Every tool now
has a pure `*-api.ts` module, a registry wrapper for the in-process
loop, and an MCP server wrapper — single source of truth, verified
by `scripts/smoke-mcp.js`.
- Enforced: tenant isolation, domain policy, delete confirms,
secrets-at-rest encryption, protected-repo / protected-app guards.
See [`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) (§ 3.6 for P5.1,
§ 3.7 for the stdio MCP bridge) for the complete current surface.
---
## Tier 1 — Blocks shipping a real product
Without these, anything the agent builds is *demo-shaped*. Ship these
next, in the recommended sequence below.
### P5.1 · Custom apex domains via OpenSRS
**Goal:** agent buys `mysaas.com` on the user's behalf and attaches it
to a Coolify app with automatic TLS.
**Why now:** you already opened an OpenSRS reseller account with a $100
float. Unlocks real branding, DKIM for email (P5.2 depends on this),
and gives you a revenue line (markup on domains).
**Surface:**
| Tool / endpoint | Purpose |
|---|---|
| `domains.search` | Live availability + suggestions via OpenSRS `lookup`. |
| `domains.check_price` | Per-TLD price from OpenSRS + markup. |
| `domains.register` | Debits workspace float, registers via OpenSRS. |
| `domains.list` | Workspace's owned domains. |
| `domains.renew` / `domains.transfer` | Lifecycle. |
| `domains.{name}.attach` | Attach to a Coolify app: DNS records + Coolify `fqdn` + Let's Encrypt. |
| `domains.{name}.detach` | Free a domain from an app, keep registration. |
| `domains.{name}.attach_status` | Polls DNS propagation + cert issuance (async). |
**Infra:**
- **OpenSRS client** (their XML/SOAP or REST API).
- **Cloud DNS** for zone management (default). CIRA D-Zone available as a
workspace-level preference for strict-residency customers.
- **Workspace float ledger** (`vibn_workspace_billing_float`) — a
prepaid balance in CAD, debited on register/renew. Reconciled nightly
against the OpenSRS master deposit.
- `VIBN_OPENSRS_DEPOSIT_ACCOUNT` as the master float handle.
**New columns** on `vibn_workspaces`:
- `preferred_dns_provider TEXT DEFAULT 'cloud_dns'`
- `cloud_dns_zone_name TEXT` ← GCP managed zone for this workspace.
**Risks:**
- DNS propagation is human-scale (minuteshours). Agents need the
async `attach_status` polling loop, not a sync call.
- Cert issuance via Let's Encrypt is rate-limited (50/week per domain).
Abuse-prevent with per-workspace rate caps.
**Estimate:** **2 weeks.**
---
### P5.2 · Transactional email (AWS SES `ca-central-1`)
**Goal:** auth providers can send password-reset emails; agents can
`email.send` from `noreply@mysaas.com`.
**Why now:** every auth provider on the allowlist is broken without
SMTP. Also pairs with P5.1 — per-workspace sender domains need DKIM on
domains you own.
**Why SES ca-central-1 specifically:** GCP has no first-party
transactional email service. All mainstream providers (Postmark,
Resend, Mailgun, SendGrid) are US-primary. SES's Montreal region is the
only credible managed option that keeps message bodies in Canada.
**Two-phase rollout:**
**Phase A — shared-sender MVP (1 week):**
- One SES-verified sender domain `mail.vibnai.com`.
- Every workspace can send from `noreply@mail.vibnai.com` out of the box.
- `email.send` tool + injected `SMTP_*` env vars.
- Bounce / complaint webhooks routed via SNS → a Cloud Run service
that writes per-workspace notifications.
**Phase B — per-workspace sender domains (1 week, depends on P5.1):**
- `email.verify_sender_domain` creates the SPF/DKIM/DMARC records via
the Cloud DNS / CIRA D-Zone client on a workspace-owned domain.
- Polls SES verification; flips `verified=true` when done.
- Workspace can now `email.send from: founder@mysaas.com`.
**Surface:**
| Tool | Purpose |
|---|---|
| `email.send` | Single message; returns SES `message_id`. |
| `email.send_batch` | Up to 100 at a time. |
| `email.list_messages` | Recent sent mail + delivery state (from SES + our log). |
| `email.verify_sender_domain` | Kick off DKIM for a workspace-owned domain. |
| `email.sender_status` | Poll verification state. |
| `email.webhooks.list` | Recent bounces/complaints. |
**Infra:**
- SES identity per workspace-owned sender domain.
- SNS topic → Cloud Run webhook receiver (in `northamerica-northeast1`)
for bounce/complaint ingestion.
- Rate limits: start in SES sandbox (200/day), request production limits
after first real customer.
**Estimate:** **2 weeks total** (1 week Phase A + 1 week Phase B).
---
### P5.3 · Object storage (Google Cloud Storage, `northamerica-northeast1`)
**Goal:** any SaaS the agent builds can take user uploads — avatars,
attachments, exports, images — without the user pasting in third-party
credentials.
**Why now:** "can users upload a file?" is the #1 post-demo question.
Blocks ~half of realistic SaaS ideas.
**GCP collapses this item.** No MinIO container to babysit; GCS provides
managed bucket + signed URLs + lifecycle policies + encryption out of
the box.
**Surface:**
| Tool | Purpose |
|---|---|
| `storage.buckets.list` | Buckets in this workspace (filtered by `workspace={slug}` label). |
| `storage.buckets.create` | New bucket. Optional `public_read`. Enforced region: `northamerica-northeast1`. |
| `storage.buckets.delete` | Destroy bucket. `confirm` gate. |
| `storage.presign_upload` | PUT URL, TTL, content-type constraint. |
| `storage.presign_download` | GET URL, TTL. |
| `storage.list_objects` | Pagination + prefix filter. |
| `storage.delete_object` | Single object. |
| `storage.set_lifecycle` | TTL delete, multipart cleanup, archive tiering. |
**Provisioning additions:**
- Default bucket `vibn-ws-{slug}` created on workspace provision.
- Uniform bucket-level access enabled by default.
- Per-workspace GCP service account `vibn-ws-{slug}@...`, scoped to its
own bucket via `roles/storage.objectAdmin`.
- Keyfile stored encrypted (AES-256-GCM, same `VIBN_SECRETS_KEY`) in
`vibn_workspaces.gcp_service_account_key_encrypted`.
**New columns** on `vibn_workspaces`:
- `gcs_bucket_name TEXT`
- `gcp_service_account_email TEXT`
- `gcp_service_account_key_encrypted BYTEA`
**Env injection:**
- `STORAGE_ENDPOINT=https://storage.googleapis.com`
- `STORAGE_BUCKET={workspace-bucket-name}`
- `STORAGE_ACCESS_KEY`, `STORAGE_SECRET_KEY` (S3-compatible via GCS HMAC keys)
— auto-injected on app creation so agent code uses standard S3 SDKs.
**Estimate:** **3 days.**
---
### P5.4 · Workers, cron, and queues (Cloud Tasks + Cloud Scheduler + Cloud Run Jobs)
**Goal:** agents can declare async workers, scheduled jobs, and queued
tasks. Anything that isn't a single `ports: 3000` web container.
**Why now:** webhooks, retries, nightly cleanup, image processing,
email sending — every real SaaS needs a non-web process. Current
workaround (second Coolify app) is brittle and manual.
**Hybrid approach — Coolify for compute, GCP for orchestration:**
Option evaluated and chosen:
- **Cloud Scheduler** (`northamerica-northeast1`) for cron: fires
HTTP webhooks into the app at the scheduled time.
- **Cloud Tasks** (`northamerica-northeast1`) for queue: agent code
calls `enqueue(task)`, Cloud Tasks dispatches to the app's worker
endpoint with retries, backoff, and at-least-once semantics.
- **Worker process** stays on Coolify as a second app-per-repo with a
different start command, exposed on an internal URL.
Rejected alternative: migrate everything to Cloud Run Jobs. More managed
but splits the "Live" view across two deploy targets and changes the
agent's mental model. Not worth it for MVP.
**Shape — extend `apps.create`:**
```json
{
"repo": "my-site",
"services": {
"web": { "command": "npm start", "ports": "3000" },
"worker": { "command": "npm run worker", "replicas": 2 }
},
"cron": [
{ "name": "nightly-backup", "schedule": "0 3 * * *", "path": "/tasks/backup" },
{ "name": "sync", "schedule": "*/10 * * * *", "path": "/tasks/sync" }
],
"queues": [
{ "name": "emails" },
{ "name": "image-processing" }
]
}
```
Internally creates: two Coolify apps (web + worker), N Cloud Scheduler
jobs labeled `workspace={slug}`, N Cloud Tasks queues.
**Surface additions:**
| Tool | Purpose |
|---|---|
| `apps.services.list` | All processes in an app. |
| `apps.services.update` | Scale replicas, change command. |
| `apps.services.logs` | Per-process logs. |
| `cron.list` | Scheduler jobs in this workspace. |
| `cron.create` / `cron.update` / `cron.delete` | Manage scheduled jobs. |
| `cron.run_now` | Fire a scheduled job immediately (useful for agent testing). |
| `queues.list` | Cloud Tasks queues in this workspace. |
| `queues.create` / `queues.delete` | Manage queues. |
| `queues.enqueue` | (Normally called from app code, but exposed for agent-driven testing.) |
| `queues.pause` / `queues.resume` | Emergency ops. |
**New columns** on `vibn_workspaces`:
- `cloud_scheduler_location TEXT DEFAULT 'northamerica-northeast1'`
- `cloud_tasks_location TEXT DEFAULT 'northamerica-northeast1'`
**Auth to GCP:** per-workspace service account (provisioned in P5.3) is
extended with `roles/cloudscheduler.admin` and `roles/cloudtasks.admin`
*scoped to resources labeled `workspace={slug}`* via IAM conditions.
Agents can only act on their own workspace's jobs/queues.
**Estimate:** **1 week.**
---
### Tier 1 total: ~5 weeks of focused work
After Tier 1 lands, an agent can:
- Buy `mysaas.com`, point it at a Next.js app.
- Deploy Authentik with working password-reset emails from `noreply@mysaas.com`.
- Offer user uploads (avatars, attachments).
- Run `0 3 * * *` nightly cleanup cron.
- Process Stripe webhooks idempotently via a retry queue.
That's a shippable SaaS. Everything after this is about *keeping* it
shipped.
---
## Tier 2 — Blocks surviving past the first real customer
Once users exist, these prevent silent failures.
### P6.1 · Database backups + restore (GCS + wal-g)
**Goal:** nightly backups, on-demand backups, one-call restore. No
"agent ran `DROP TABLE` in a migration" permanent data loss.
**Why:** scariest item on this list. Failure mode is irrecoverable.
**Shape:**
- `databases.{uuid}.backup` — on-demand `pg_dump` / `mongodump` to the
workspace's GCS bucket (depends on P5.3).
- `databases.{uuid}.backups.list` — lists backups with timestamp + size.
- `databases.{uuid}.backups.restore``confirm`-gated restore from a
specific backup uuid.
- Per-database backup policy: daily / hourly / off, retention days.
- Default: every AI-created database gets daily backups + 7-day
retention on.
**Infra:**
- Cron jobs run via P5.4's Cloud Scheduler primitive.
- Stored at `gs://vibn-ws-{slug}/backups/{db-uuid}/{iso-timestamp}.sql.gz`.
- Lifecycle rules auto-delete backups older than retention.
- Object-level retention lock available for "immutable backups" on
request (Tier 3 feature).
**Upgrade path:**
- **Postgres point-in-time recovery** via `wal-g` shipping WAL segments
to the same GCS bucket. Adds RPO < 5 min.
- **ClickHouse**: `clickhouse-backup` to GCS.
- **MongoDB**: `mongodump` incremental.
**Estimate:** **3 days** for MVP (pg_dump + schedule + restore).
**+1 week** for wal-g PITR if/when a customer asks.
---
### P6.2 · Runtime log streaming (Cloud Logging)
**Goal:** agent can see "is the app erroring at 10 req/s right now?",
not just "did the build succeed."
**Why:** today deploy logs are surfaced but container stdout/stderr is
not. An agent that "fixed a bug" can't verify the fix without a human
SSH-ing into Coolify.
**GCP collapses this item** — ship container logs to Cloud Logging with
a workspace label, query via the logs API.
**Shape:**
- Fluent-bit sidecar (or Coolify label) ships container stdout/stderr
to Cloud Logging in `northamerica-northeast1` with labels
`workspace={slug}`, `app={app-uuid}`, `service={web|worker|...}`.
- Per-workspace log bucket for retention isolation.
**Surface:**
| Tool | Purpose |
|---|---|
| `apps.logs` | Last N lines across replicas. Filter by timestamp, severity. |
| `apps.logs.tail` | SSE stream of new log lines. |
| `apps.logs.search` | Thin wrapper on Cloud Logging's query API — grep, severity filter, time window. |
| `apps.services.logs` | Same, scoped to a single service. |
**Retention:** default 30 days in the workspace log bucket; exportable
to the workspace's GCS bucket on request for long-term storage.
**Estimate:** **3 days** (fluent-bit config + thin API wrapper).
---
### P6.3 · Scoped API keys
**Goal:** invite a CI bot or teammate without giving root on the
workspace.
**Why:** solo-builder flow survives without it. Breaks the moment a
second principal enters.
**Shape:**
- Keys gain `scopes: string[]` and optional `expires_at`.
- Scope tokens: `apps:read`, `apps:write`, `apps:delete`,
`databases:*`, `auth:*`, `domains:read`, `domains:write`,
`storage:*`, `email:send`, `cron:*`, `queues:*`, `deploy:*`.
- Per-scope rate limits optional (Tier 3; API shape supports it from
day one).
**Surface changes:**
| Tool | Change |
|---|---|
| `keys.create` | Accepts `scopes`, `expires_at`. |
| `keys.list` | Returns scopes per key. |
| `keys.rotate` | Mints new token, preserves scope set. |
Every MCP/REST handler gets a scope requirement checked in the
principal resolver.
**Estimate:** **1 week.**
---
### Tier 2 total: ~2 weeks
After Tier 2 lands, a SaaS shipped on Vibn can survive without you
dropping into a psql REPL at 3am.
---
## Tier 3 — Matters once usage scales
Don't build these until at least one real customer is hitting them.
Building them pre-market is the classic infra-overinvestment trap.
### P7.1 · Per-workspace quotas + cost caps
Max apps, max dbs, max GCS GB, max egress, max SES messages/month, max
OpenSRS spend/month. Per-plan configurable. Hallucinating agents can't
OOM the cluster or burn your SES reputation.
### P7.2 · Audit log
Append-only per-workspace log of (principal, action, params, timestamp,
result). Cloud Logging with a dedicated `audit-logs` log-bucket, 400-day
retention. Read API for the settings panel. Needed for any
SOC-2-adjacent buyer.
### P7.3 · Preview-per-PR environments
Open a PR → `pr-42.mark.vibnai.com` deploys automatically with a
throw-away database. Teardown on PR close/merge. Unblocks multi-agent
flows.
### P7.4 · Atomic multi-resource operations (`stacks`)
`POST /stacks` takes a full app + db + auth + domain + cron spec;
creates atomically, rolls back on failure. Agent ergonomics win once
demo flow is routine.
### P7.5 · Billing integration
Stripe subscriptions for Vibn itself (workspace billing), plus
per-workspace float top-ups, plus reconciliation to the OpenSRS master
deposit and GCP / SES cost allocation. Only needed when you charge
real dollars.
### P7.6 · Assured Workloads for Canada
GCP policy-enforced Canadian residency + Canadian personnel access.
For regulated customers (healthcare, financial, public sector). Priced
accordingly; ship only when a real customer needs it.
### P7.7 · CIRA D-Zone as a workspace DNS option
Swap Cloud DNS → CIRA D-Zone for a workspace with strict residency
requirements. API-compatible wrapper so nothing agent-facing changes.
---
## Tier 4 — Revisit when demanded
Items to explicitly *not* build until a concrete customer asks.
- **Multi-region** — single-region Canada is fine for B2B SaaS makers
(our early market).
- **Cloud Run migration** — would rewrite most of Coolify-based
capabilities. Revisit if/when Coolify becomes a bottleneck.
- **Managed search / vector DB as first-class types** — agents can
deploy Meilisearch / Typesense / pgvector-Postgres as regular services.
- **mTLS / custom CAs / BYO-cert upload** — enterprise creep.
- **MCP protocol polish** (streaming, resources, prompts, per-tool
schemas) — current JSON-over-HTTP works. Revisit on real friction.
- **Per-app basic auth, IP allowlists, WAF** — Traefik middleware
manually until someone asks.
---
## Roadmap at a glance
| Phase | Items | Est. | Unblocks |
|---|---|---|---|
| **P5 — Real SaaS primitives** | Domains, email, storage, workers/cron/queues | ~5 wk | Shipping a real product |
| **P6 — Keep-it-running** | Backups, runtime logs, scoped keys | ~2 wk | First real customer survives |
| **P7 — Scale** | Quotas, audit, previews, stacks, billing, Assured Workloads, D-Zone | demand-driven | Platform grows past 1st cohort |
| **P8+** | Tier 4 items | never, unless pulled by customer | — |
**Total to "agent ships a SaaS a founder would pay $29/mo for":**
P5 + P6 = **~7 weeks** (was ~11 before GCP-CA; ~40% compression from
managed-service leverage).
---
## Dependency graph
```
P5.1 Domains ──┬──→ P5.2 Email Phase B (per-domain DKIM)
├──→ P7.7 CIRA D-Zone swap
└──→ (future: customer-owned sub-domain routing)
P5.3 Storage ──┬──→ P6.1 Database backups (backups need a bucket)
└──→ P7.2 Audit log export
P5.4 Workers/cron/queues ──┬──→ P6.1 Database backups (run via scheduler)
└──→ most real SaaS patterns
P6.2 Runtime logs — independent, can land anytime
P6.3 Scoped keys — independent, can land anytime
P7.6 Assured Workloads — wraps everything; build once demanded
```
**Parallelizable (three people):**
- Track A: P5.1 → P5.2
- Track B: P5.3 → P6.1
- Track C: P5.4 → P6.2
Track C finishes earliest; use that slack to land P6.3.
---
## Per-workspace GCP provisioning (shared across P5.3, P5.4, P6.1, P6.2)
`ensureWorkspaceProvisioned()` gains a GCP-CA block that runs once per
workspace, idempotently. All resources are created in
`northamerica-northeast1`.
| Resource | Name pattern | Notes |
|---|---|---|
| GCS bucket | `vibn-ws-{slug}` | Uniform bucket-level access. Lifecycle policies off by default. |
| Cloud DNS managed zone | `vibn-ws-{slug}-zone` | Created per workspace-owned domain in P5.1, not on workspace provision. |
| Cloud Logging log bucket | `vibn-ws-{slug}-logs` | 30-day retention default. |
| Cloud Tasks location | `northamerica-northeast1` | Queues created per-app in P5.4, not here. |
| GCP service account | `vibn-ws-{slug}@{project}.iam` | Single SA per workspace, narrow roles. |
| Service account key | stored encrypted in `vibn_workspaces` | AES-256-GCM, same `VIBN_SECRETS_KEY`. |
**New columns** on `vibn_workspaces` (cumulative across P5.1-P6.2):
```sql
-- P5.1
preferred_dns_provider TEXT DEFAULT 'cloud_dns',
cloud_dns_zone_name TEXT,
-- P5.3
gcs_bucket_name TEXT,
gcp_service_account_email TEXT,
gcp_service_account_key_encrypted BYTEA,
-- P5.4
cloud_scheduler_location TEXT DEFAULT 'northamerica-northeast1',
cloud_tasks_location TEXT DEFAULT 'northamerica-northeast1',
-- P6.2
cloud_logging_bucket_name TEXT
```
Three migration steps, one per phase. All guarded by the existing
admin-gated `POST /api/admin/migrate` endpoint.
---
## Non-goals (stated explicitly so they don't creep in)
- **A general-purpose PaaS.** Vibn is an agent-driven SaaS builder, not
a Heroku / Fly clone. Every capability must answer "what does an agent
need to build a SaaS?" — not "what does a dev need to deploy a
container?"
- **Support for non-allowlisted auth providers, databases, services.**
The curated surface is the feature. "Any Coolify service" would blow
up the tenant-safety model and dilute agent decision-making.
- **A consumer-facing OpenSRS UI.** OpenSRS is plumbing for the agent.
Humans should never see an OpenSRS checkout screen — only
`domains.register { name: "mysaas.com" }` from the agent.
- **Multi-cloud abstraction layer.** One Coolify cluster + GCP-CA +
SES-CA + OpenSRS is the contract. If customers want to bring their
own, that's Tier 4.
- **Anything that moves customer data out of Canada.** Even for
performance. If a managed service only has US regions, we self-host
in Canada or we don't offer it.
---
## Recommended execution order (opinionated)
Given dependencies and quick-wins-first philosophy:
**Week 1:**
- P5.3 Storage (GCS wrap, 3 days) → proves the GCP-CA provisioning pattern.
- P5.4 Workers/cron/queues (starts in parallel; depends on P5.3 only for
the service account).
**Week 2:**
- P5.4 completes.
- P5.1 Domains starts (OpenSRS client + Cloud DNS wrapper).
**Week 3:**
- P5.1 completes.
- P5.2 Email Phase A (shared-sender MVP) starts.
**Week 4:**
- P5.2 Phase A completes.
- P5.2 Phase B (per-domain DKIM) starts, now that P5.1 is available.
**Week 5:**
- P5.2 Phase B completes. **P5 / Tier 1 done.**
- P6.1 Database backups starts (3 days).
- P6.2 Runtime logs starts in parallel (3 days).
**Week 6:**
- P6.3 Scoped keys (1 week).
**Week 7:**
- Slack week — hardening, docs (`AI_CAPABILITIES.md` refresh), first
real customer onboarding.
**End state at week 7:** agent can take a founder from "I have an idea"
to "I have `mysaas.com` live, with auth, with user uploads, with email,
with backups, with visible error logs, and a CI bot can deploy it
without root access."
That's the Vibn product.
---
## How to use this doc
- When someone proposes a feature, find its tier. If it's Tier 3 or 4
and we're still shipping Tier 1, say no.
- Before starting a Tier 1 item, re-read its section and make sure
prerequisites shipped. Email-per-domain before domains is wasted code.
- [`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) is the canonical
reference of *what exists today*. This doc is the canonical reference
of *what comes next*. When an item ships, move it from here to that
doc and delete its section here.
- When a user request implies Canadian residency (they say "PIPEDA",
"healthcare", "public sector", or "our data can't leave Canada"), pin
the answer to this doc's §0 Substrate & constraints. Don't improvise.
Current pending capabilities/roadmap items are tracked in `BETA_LAUNCH_PLAN.md`.

View File

@@ -1,227 +1,8 @@
# AI Harness Gaps — Proposal
# AI Harness Stability & Middleware (Shipped)
> Four gaps in the Vibn AI experience that are **structural, not promptable**.
> Each one is responsible for a specific failure pattern visible in real
> production chat transcripts. None of them are scoped in
> [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md),
> [`BETA_LAUNCH_PLAN.md`](./BETA_LAUNCH_PLAN.md),
> [`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md), or the
> agent-execution / telemetry-streaming designs.
>
> **Drafted:** 2026-04-30 (after a transcript review of the Dr Dave + Twenty CRM threads).
>
> **Why these four:** they share a common shape — the model is doing what
> the prompt told it to, and still producing a bad outcome. The fix lives
> in the *harness around the model*, not in instructions to the model.
> **Note:** These middleware stability mechanisms have been shipped.
---
## TL;DR
| # | Gap | Failure pattern in prod | Fix size |
|---|---|---|---|
| 1 | Tool-error recovery middleware | Orphan twenty-* services (4 shipped). Model keeps delete-and-recreating despite explicit prompt rule against it. | ~2 hr |
| 2 | Browser-driver tool for the AI | "Should be live in 10s" — AI ships URLs without ever loading them; user discovers the 502. | ~4 hr |
| 3 | Live UI state attached to chat messages | "this isn't working" / "fix the URL" with no signal of which "this". AI guesses, often wrong. | ~3 hr |
| 4 | Diff preview / accept-changes gate | `fs_edit` writes straight to the dev container with no review surface. Fine for sub-second iteration; bad for prod-bound edits. | ~6 hr |
Total: ~15 hr of work. None require new infra.
---
## Gap 1 — Tool-error recovery middleware (highest ROI)
**Failure observed:** in thread `d698ef40-…` ("Hey there, what can you see about this project?"), the AI hit
`Conflict. The container name "/postgres-…" is already in use` **three separate times**.
On each attempt it responded by *creating a new service with a new name*,
not by calling `apps_unstick`. The prompt explicitly tells it not to do
this and tells it the recovery sequence. The model still did it.
**Why prompt rules fail here:** the model treats the system prompt as
soft guidance against a 30k-token document; the tool result is concrete
and 200ms-fresh. When tool reality contradicts prompt rules, tool
reality wins.
**Proposed fix:** middleware in `executeMcpTool` that pattern-matches
known-recoverable errors and **injects a synthetic system message** into
the conversation before the next round. The model can't ignore an
injected instruction the way it can ignore a static prompt rule.
```ts
// In app/api/chat/route.ts, around the executeMcpTool call:
const errorRecovery = detectKnownError(result);
if (errorRecovery) {
messages.push({
role: "system",
content: `[RECOVERY] ${errorRecovery.diagnosis}. Required next action: ${errorRecovery.fix}. Do NOT ${errorRecovery.antipattern}.`,
});
}
```
**Initial recovery rules** (high-confidence, low-false-positive):
| Error signature | Diagnosis | Fix | Antipattern |
|---|---|---|---|
| `Conflict. The container name … is already in use` | Orphan container blocking new boot | `apps_unstick { uuid }` then `apps_deploy { uuid }` | Delete and recreate with a new name |
| `pull access denied` / `manifest unknown` | Image not on the host yet | `apps_repair { uuid }` | Retry deploy without addressing the cause |
| `port … is already allocated` | Another container holds the port | List containers, identify holder, decide | Pick a random different port |
**Effort:** ~2 hr. New file `lib/ai/error-recovery.ts` with a registry of
patterns + the injection in the chat route. Each rule is ~10 lines.
**Slot into:** `BETA_LAUNCH_PLAN.md` Phase 2 (Stability & visibility) — fits next to 2.4 (deployment-failed webhook).
---
## Gap 2 — Browser-driver tool for the AI
**Failure observed:** in the same Twenty thread, the AI said *"It's
fully deployed, healthy, and I've verified it's returning a 200 OK
status"* — but the user saw "Unable to Reach Back-end" on the actual
page. The AI checked Coolify's status reporting, not the rendered app.
Also visible in the Dr Dave thread: *"Note: it might take 10-15 seconds
on the very first load for the DNS to propagate"* — the AI hedged
because it couldn't load the URL itself.
**Why this matters for beta:** every "I deployed it" claim is unverified
unless the AI can open the URL. Sentry (planned in P2.3) catches
errors *after a user hits them*. A browser tool catches errors
*before any user hits them*.
**Proposed fix:** add a `browser.*` MCP tool surface backed by a
headless Chromium running on the Coolify host (or in the vibn-dev
container). Initial tools:
| Tool | Purpose |
|---|---|
| `browser.navigate { url, timeoutMs? }` | Load the URL, return final URL + status code + page title |
| `browser.screenshot { url }` | Visual confirmation. Return base64 PNG (or store in GCS) |
| `browser.console_logs { url }` | Capture client-side JS errors (the `TypeError: reading 'z'/'j'/'aa'` from BETA P2.2 would be findable this way) |
| `browser.fetch { url, headers? }` | HTTP-level smoke test. Subset of `http_fetch` but always from inside Vibn's network |
**Implementation:** Playwright already has an MCP server (`@modelcontextprotocol/server-playwright`).
Wire it as a Coolify service, expose via the same per-workspace MCP
token Vibn already issues.
**Effort:** ~4 hr. ~2 hr to deploy Playwright as a service, ~1 hr to
add tool definitions, ~1 hr to wire prompt instructions ("after any
deploy or `dev_server.start`, call `browser.navigate` to confirm").
**Slot into:** Phase 2 (Stability & visibility) — pairs with the
runtime error chase (2.1, 2.2) and the Sentry wiring (2.3).
---
## Gap 3 — Live UI state attached to chat messages
**Failure observed:** in the Dr Dave thread, user typed *"are you able
to give me a preview url?"* The AI didn't know which port the
Next.js dev server would bind to, what was already running, or
whether the user was looking at the chat or another tab. It
guessed and re-discovered everything from scratch.
In the Twenty thread, *"can you see the different sections?"* — user
meant Plan tab sections (Vision/Tasks/Decisions/Ideas). AI listed
metadata. No way to know.
**Why prompt rules can't fix this:** the AI literally lacks the
information.
**Proposed fix:** the chat panel sends a small `uiContext` object
alongside every user message. Inject into the system prompt as a
dynamic block (same shape as `activeBlock`):
```ts
{
currentRoute: "/mark-account/project/abc/hosting",
currentTab: "hosting",
visibleResources: [
{ kind: "app", uuid: "y4cs…", name: "vibn-frontend" },
{ kind: "service", uuid: "igcp…", name: "vibn-dev-twenty-crm" },
],
lastUserActions: [
{ at: "2m ago", action: "opened twenty-crm logs" },
{ at: "5m ago", action: "switched to Hosting tab" },
],
}
```
System-prompt block becomes:
> The user is currently looking at the **Hosting tab** (route: `…/hosting`).
> Visible resources: `vibn-frontend`, `vibn-dev-twenty-crm`.
> Recent actions: opened twenty-crm logs (2m ago), switched to Hosting (5m ago).
> When the user says "this" / "it" / "the URL" — assume they mean
> something visible in the current viewport unless they name something else.
**Effort:** ~3 hr. ~1 hr to wire the chat panel's
`uiContext` collection (existing route + tab state, last 5 actions
from a small ring buffer in the panel), ~1 hr to plumb through the
chat API, ~1 hr to add the prompt block.
**Slot into:** Phase 3 (UX surfaces) — pairs with 3.2 (structured
errors in chat) and 3.3 (empty-state nudges).
---
## Gap 4 — Diff preview / accept-changes gate
**Failure observed:** none yet, but the surface is exposed today —
`fs_edit` writes directly to `/workspace` in the dev container. For
ephemeral exploration this is correct (sub-second iteration is the
whole Path B point). For changes destined to ship, the user has no
review surface; they only see what changed after the AI summarizes.
**Why this matters for beta:** the moment a paying user wants to
"see what the AI changed before it goes live," there's nothing to
show them. Cursor's whole UX is built on diffs the user accepts.
**Proposed fix:** two-mode `fs_edit` / `fs_write`:
1. **Direct mode (default for dev container):** write immediately. Current
behavior. Fine for "make the button blue" iteration.
2. **Staged mode (default when `ship` is the next likely action):**
write to a shadow path, surface a diff in the chat UI, gate the
real write on a one-click "Accept" button.
The model decides which mode based on context — or simpler: stage when
the file is in a "protected" set (e.g. `prisma/schema.prisma`,
`Dockerfile`, `package.json`, anything in `prod/` or `migrations/`),
direct otherwise.
**Effort:** ~6 hr. ~2 hr backend (shadow write + apply endpoint),
~3 hr UI (diff renderer in the chat panel, accept/reject buttons),
~1 hr prompt + tool changes.
**Slot into:** Phase 4 (Onboarding & safety) — pairs with 4.5 (auth
hardening) and 4.6 (compute quotas) as part of "what a stranger
needs day 1."
---
## Suggested sequencing
If we ship in priority order:
1. **Gap 1 first** — kills the worst pattern in prod for ~2 hr of work. Should be ahead of any new feature in Phase 2.
2. **Gap 2 second** — closes the verify-deploy loop. Multiplies the value of every subsequent AI-shipped change because it's no longer blind.
3. **Gap 3 third** — tighter conversational UX. Once 1 and 2 work, the remaining UX cliff is "AI doesn't know what I'm looking at."
4. **Gap 4 last** — only matters once we have paying users editing prod-bound code. Pre-beta optional.
Total effort to ship 1+2+3 (the meaningful UX wins): **~9 hours.**
---
## How this changes BETA_LAUNCH_PLAN.md
Two new tasks slot in:
- **P2.8** Tool-error recovery middleware (Gap 1) — block on nothing, ship before P2.4.
- **P2.9** Browser-driver MCP tool (Gap 2) — block on nothing.
One new task in P3:
- **P3.7** UI-state injection into chat (Gap 3) — block on nothing.
Gap 4 stays out of beta scope unless eval reveals real damage from
unstaged edits.
- The chat loop (`app/api/chat/route.ts`) acts as a robust harness that intercepts tool errors and automatically suggests recovery paths (e.g., port conflicts, container collisions).
- The maximum tool execution loop is capped (`MAX_TOOL_ROUNDS=30`) to prevent runaway AI loops.
- `fs_edit` uses line-number replacements alongside strict `oldString` matching to avoid Aider-style search-and-replace failures.
- Sentry and Coolify deployment webhooks automatically pipe deployment/build failures back to the user/AI.

View File

@@ -1,288 +1,12 @@
# Path B Execution Plan — Persistent Dev Container Architecture
# AI Path B (Shipped)
> The plan to replace Vibn's current "API-wrap-every-Coolify-action" agent
> surface with a Claude-Code-style architecture: one persistent dev
> container per Vibn project, ~10 composable tools, sub-15-second
> iteration, and Coolify only touched at "ship it" time.
>
> **Companion to:** [`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) (current
> state) and [`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md)
> (everything else).
>
> **Status:** week 1 shipped (2026-04-28). Tool surface is live in code; image build on Coolify host + DNS wildcard + Traefik wiring still pending.
>
> **Why this exists:** today's AI loop is *37 min to first preview, 24
> min per iteration*, because every change goes through a Coolify nixpacks
> build. That UX cannot host the marketplace / SaaS / iterative-build
> stories Vibn is selling. Path B fixes the floor.
> **Note:** This document outlines the architecture for "Path B", which shifted the AI's execution context from Cloud Run to persistent per-project Docker containers hosted on the Coolify server. This architecture was fully successfully shipped in May 2026.
---
## Architecture
- Every project has a persistent Gitea repository.
- Every project gets a single `vibn-dev` container provisioned as a Coolify service (`ensureDevContainer`).
- The AI runs its tools (like `shell_exec` and `fs_*`) *inside* this container using `docker exec` via the Coolify API.
- Dev servers (like `npm run dev`) bind to `0.0.0.0:3000` and are exposed to the internet via Traefik wildcard subdomains (`*.preview.vibnai.com`).
- When the user is ready, the code is committed to Gitea and deployed to production via `apps_deploy`.
## 1. The user experience this unlocks
Reference scenario: a non-technical founder chats *"build me a
two-sided marketplace for handmade ceramics."*
| Phase | Path A (today) | Path B (target) |
|---|---|---|
| Discovery & OSS pick | OK | OK |
| Fork an OSS base (e.g. Sharetribe, 800 files) | ~15 min of single-file commits, 800 webhook fires | `git clone` in 8s |
| First live preview | 37 min (Coolify build) | ~30s (Vite HMR in dev container) |
| Each iteration | 24 min (rebuild) | 315s (HMR / process restart) |
| User makes 10 small decisions | ~40 min of staring at spinners | ~3 min of conversation |
| "Ship it" → real domain | already 3 min | 3 min (unchanged — this is the only Coolify build) |
| Total time to live, polished marketplace | 3060 min, often abandoned | ~20 min, mostly the user thinking |
The asymmetry is structural, not optimisable inside Path A.
---
## 2. Architecture overview
```
┌──────────────────────────┐ ┌────────────────────────────────┐
│ vibnai.com chat (user) │ ←→ │ /api/mcp │
└──────────────────────────┘ │ ├ shell.exec │
│ ├ fs.read / fs.edit / fs.glob │
│ ├ dev_server.start │
│ ├ ship │
│ └ apps.* / databases.* / ... │
└────────────┬───────────────────┘
▼ (workspace-scoped)
┌────────────────────────────────────┐
│ Per-Vibn-project Coolify project │
│ ├ vibn-dev ← dev container │
│ ├ web ← prod app │
│ ├ db │
│ └ ... │
└────────────────────────────────────┘
```
### Per-project dev container — the only new piece
For every active Vibn project, we run **one long-lived Coolify
service named `vibn-dev`** inside that project's dedicated Coolify
project (Stage 2/3 of per-project isolation already shipped).
| Property | Value |
|---|---|
| **Image** | `ghcr.io/vibnai/vibn-dev:latest` (we build & maintain) |
| **Base** | Ubuntu 24.04 |
| **Pre-installed** | Node 20, bun, pnpm, Python 3.12 + uv, Go 1.23, Rust, git, gh, `tea` (Gitea CLI), ripgrep, fd, jq, curl, tar, openvscode-server |
| **Default `cwd`** | `/workspace` (persistent volume containing the Gitea working tree) |
| **Persistent volumes** | `/workspace` (git tree), `/cache/{npm,pip,go,cargo}` (package caches) |
| **Resource floor** | 512 MB / 0.25 CPU when idle |
| **Resource ceiling** | 4 GB / 2 CPU during builds (configurable per workspace plan) |
| **Idle suspend** | After 30 min no `shell.exec` activity |
| **Re-wake** | Any `shell.exec` / `fs.*` / `dev_server.*` call |
| **Ports** | 30009999 reserved for the AI's dev server, exposed at `https://preview-{ws}-{project}.vibnai.com` via Traefik wildcard |
| **Tenancy** | Inherits per-project Coolify isolation — workspace can never reach into another's dev container |
### Why this shape (and not e2b / Cloud Run / VM-per-task)
- We already have Coolify, per-project Coolify projects, and Coolify
exec primitives. Adding one service per project is zero new infra.
- Persistence (workspace state, package cache, git working tree)
matters more than per-task isolation for our user. Founders return
to projects across sessions.
- Tenant safety is already solved at the Coolify-project layer.
- Cost stays bounded: one container per *active* project, idle-suspended.
- Upgrade path to e2b / Firecracker exists later if needed (replace the
executor, keep the tool surface).
---
## 3. Tool surface
### New tools (the AI's primary working set)
| Tool | Signature | Purpose |
|---|---|---|
| `shell.exec` | `{ cmd, cwd?, timeoutSec?, env? }` | Run any shell command in the dev container. Streams stdout/stderr back. Capped 15 min. |
| `fs.read` | `{ path, ref? }` | Read a file (or directory listing) from `/workspace`. |
| `fs.write` | `{ path, content }` | Create/overwrite a file. |
| `fs.edit` | `{ path, oldString, newString, replaceAll? }` | Aider-style search/replace. Fails if `oldString` not found / not unique. |
| `fs.glob` | `{ pattern, cwd? }` | List files matching a pattern (e.g. `**/*.tsx`). |
| `fs.grep` | `{ pattern, glob?, contextLines? }` | ripgrep-backed code search. |
| `fs.delete` | `{ path }` | Delete a file or directory. |
| `dev_server.start` | `{ cmd, port, name? }` | Start a long-running process (e.g. `npm run dev`). Returns a public preview URL. |
| `dev_server.stop` | `{ id }` | Kill a dev server. |
| `dev_server.list` | — | What's running, on what URL. |
| `ship` | `{ projectId, commitMsg, deploy? }` | `git add . && git commit && git push` to Gitea, then trigger Coolify deploy of the prod app. The "graduate to production" tool. |
### Kept (orchestration — these are correctly modeled as APIs)
- `apps.*` — Coolify app CRUD, logs, domains, env vars, etc.
- `databases.*`, `auth.*`, `domains.*`, `storage.*` — infrastructure primitives.
- `projects_get`, `projects_list`, `workspace_describe` — context.
- `github_search`, `github_file`, `http_fetch` — external lookup.
### Deprecated (kept for back-compat, banner in docs)
- `gitea_file_read`, `gitea_file_write`, `gitea_file_delete`,
`gitea_branches_list`, `gitea_branch_create`,
`gitea_repo_create`, `gitea_repo_get`, `gitea_repos_list` — the
AI uses `shell.exec` (`git`/`tea` CLI) and `fs.*` instead.
- `apps.exec` — kept (it's still useful for prod-container debugging),
but deprecated for *dev-time* code work.
**Net change:** 53 tools → ~30 tools, but the new ones compose to do
everything the old ones did and more.
---
## 4. The system prompt rewrite
The AI's prompt today says *"call gitea_file_write to push code."* It
becomes:
> You have a real Linux dev environment for this project at `/workspace`.
> Use `shell.exec` to run any command (npm, git, tea, python, anything).
> Use `fs.edit` for surgical changes, `fs.write` for new files.
>
> Standard loop:
> 1. `shell.exec { cmd: "git status" }` to see what's there.
> 2. Edit / create files via `fs.edit` / `fs.write`.
> 3. `shell.exec { cmd: "npm test" }` (or relevant test runner).
> 4. `dev_server.start` to give the user a live preview URL.
> 5. When the user says "ship it", call `ship` — that pushes and
> triggers the production Coolify deploy.
>
> NEVER call `apps_create` to deploy code that hasn't been tested via
> `shell.exec` first. The dev container is your safety net.
---
## 5. Week-by-week execution
### Week 1 — Foundations (dev container + shell) — **SHIPPED 2026-04-28**
**Goal:** AI can clone a repo, install deps, run a script.
- [x] `vibn-dev/Dockerfile` (Ubuntu 24.04 + git + ripgrep + python3 + mise lazy toolchains). `setup-on-coolify.sh` builds it on the host; compose uses `pull_policy: never` to avoid registry round-trips.
- [x] `lib/dev-container.ts`: ensure / exec / suspend / resume helpers. Backed by `fs_project_dev_containers` (auto-created).
- [x] `devcontainer.{ensure,status,suspend}` MCP tools.
- [x] `shell.exec` + `fs.{read,write,edit,list,delete,glob,grep}` MCP tools — all enforce per-workspace tenancy via `fs_projects` ownership lookup, all locked to `/workspace`.
- [x] Network isolation: per-project `vibn-dev-net-${slug}` bridge — no route to `vibn-postgres` / `vibn-frontend`.
- [x] Kill switch: `/api/admin/path-b/{disable,enable}` flips a feature flag in <10s.
- [x] `vibn-tools.ts`: 11 new Gemini tool defs, smoke test passes (63 tools accepted).
- [x] System prompt rewritten — shell-first guidance, `gitea_file_*` flagged for hard removal in week 3.
**Still pending for week 1 exit:** build the image on the live Coolify host (`ssh + setup-on-coolify.sh`), end-to-end verify `devcontainer.ensure → shell.exec ls` against a real project once the frontend deploy lands.
### Week 2 — Preview URLs + iteration — **PARTIALLY SHIPPED 2026-04-28**
**Goal:** AI starts a dev server, user clicks a preview URL, sees their app.
- [ ] DNS: `*.preview.vibnai.com → coolify-host-ip` in OpenSRS. **Manual step, not yet done.**
- [ ] Traefik wildcard cert via DNS-01 against OpenSRS. **Config staged in `vibn-dev/PREVIEWS.md`, not yet applied to live Traefik.**
- [x] `dev_server.{start,stop,list,logs}` MCP tools. Process is `nohup`'d inside the container, PID/port/preview-url tracked in `fs_dev_servers`. Server is reachable from inside the container today; Traefik label injection is **deferred** (see PREVIEWS.md for the recommended pre-allocated-port-range approach).
- [x] `fs.edit` Aider-style (HTTP 404 if missing, 409 if ambiguous, success returns replacement count).
- [x] Per-container CPU/RAM caps: 1 vCPU / 1 GiB by default. Tier scaling via env var.
- [x] System prompt rewritten with shell-first recipe.
**Exit criteria progress:** end-to-end works inside the container; preview URL routing is the last mile.
### Week 3 — Ship-it path + cleanup — **PARTIALLY SHIPPED 2026-04-28**
**Goal:** the dev container's working tree graduates to production.
- [x] `ship` MCP tool: `git init` (if needed) → `git add -A && git commit && git push` to Gitea using the workspace bot PAT, then triggers `deployApplication` if the project has a linked Coolify app.
- [x] Auto-push autosave to `vibn-autosave/main` branch (force-push, throttled to once per 5 min). Endpoint: `POST /api/admin/path-b/autosave { projectId | sweep:true }`.
- [x] Idle-suspend sweep: `POST /api/admin/path-b/idle-sweep[?minutes=30]`. Wire to a 5-min cron once we trust the suspend path.
- [ ] Hard-remove `gitea_file_*` from the AI tool list (keep REST endpoints alive 30 days). **Deferred to next week so we can A/B the new tools first.**
- [ ] Update `AI_CAPABILITIES.md`. **Deferred — will rewrite once eval data is in.**
**Exit criteria progress:** ship loop is functionally complete. Outstanding: full prod test against a real project, gitea_file_* hard-remove, docs refresh.
### Week 4 — Eval, polish, IDE drop-in
**Goal:** measure that this actually delivers the promised UX, ship the optional graduation path.
- [ ] **Eval harness:** 10 reference prompts (TODO app, marketplace, blog with auth, kanban, image-uploader, AI chatbot, simple e-commerce, dashboard, REST API + DB, static site). Measure: time-to-first-preview, time-to-shipped, AI tool-call count, success rate. Compare to a baseline run on Path A.
- [ ] **Theia drop-in:** expose openvscode-server (already in the image) at `https://ide-{ws}-{project}.vibnai.com`. Optional toggle in chat UI: "Open IDE." Lets a user-becoming-developer drop into the same `/workspace` the AI's been editing.
- [ ] **Bug fixes** found during eval.
- [ ] **Docs:** update Vibn's user-facing pages to reflect the new "describe → live preview in seconds → iterate → ship" flow.
**Exit criteria:** eval shows ≥3× speedup on time-to-first-preview vs.
Path A, ≥80% success rate on the 10 reference prompts.
---
## 6. OSS we will lean on (not reinvent)
| Need | OSS choice | Notes |
|---|---|---|
| Dev container image base | Ubuntu 24.04 + toolchains | We bake & maintain. ~1 GB. |
| In-browser IDE (week 4 graduation path) | `openvscode-server` (`gitpod-io/openvscode-server`, MIT) | Pre-installed in the image. Optional toggle. |
| Edit format | **Aider's search/replace block format** (`Aider-AI/aider`, Apache 2.0) | Borrow the format + error semantics. |
| Process supervision inside the container | `tini` (already standard) + a tiny in-house supervisor for `dev_server.*` | No need for full systemd. |
| Code search inside the container | `ripgrep` (`BurntSushi/ripgrep`, MIT) | Pre-installed. `fs.grep` is a thin wrapper. |
| Git inside the container | `git` + `tea` (Gitea CLI, MIT) | `tea` lets the AI do PR ops without us building gitea_pr_* tools. |
| Reference for end-to-end agent loops | `All-Hands-AI/OpenHands` (MIT) | Read their runtime + tool design. Don't import their code. |
| Reference for fast iteration UX | `bolt.new` (`stackblitz/bolt.new`) | UX north star, not a code source. |
---
## 7. Risks & open questions
| Risk | Mitigation |
|---|---|
| **Dev containers eat money.** 100 active projects × 24/7 = ~$50/mo wasted. | Idle-suspend after 30 min. Resume in <5s. Per-plan caps. Auto-delete suspended-and-untouched volumes after 30 days. |
| **`shell.exec` is the universal escape hatch — security?** AI inside a single workspace's container can do anything that container can do. | (a) Per-project Coolify isolation. (b) **Network policy: dev containers have NO route to internal Vibn services (vibn-postgres, vibn-frontend, Coolify control plane). Implemented via Docker network rules in week 1, not deferred.** (c) Audit log on every `shell.exec` call. (d) Per-container CPU/RAM caps absorb fork-bomb / coin-mining attempts. |
| **Preview URL leaks.** `https://preview-mark-ceramic-market.vibnai.com` is publicly resolvable. | Default: random suffix in subdomain (`preview-mark-ceramic-market-7a3f.vibnai.com`) — ~64 bits of unguessability. Optional Vibn-session-cookie auth as paid-tier feature later. |
| **Hot reload through Traefik.** WebSocket / HMR can be finicky over a reverse proxy. | **Spike on week 1, day 1**: bring up a Vite dev server inside vibn-dev, expose via Traefik, edit a file, verify HMR fires. Failure here is the biggest "things look fine until you actually test" risk; de-risk early. |
| **Image size / pull time on first project.** ~1 GB pull adds 3060s to first dev container spin-up. | (a) Pre-pull image on every Coolify host on deploy. (b) **Keep base image small (~500 MB: OS + git + ripgrep + supervisord + IDE server). Lazy-install language toolchains via `mise` on first project use.** Prevents the image from bloating to 4 GB six months from now. |
| **Dependency cache poisoning.** Cached `node_modules` from project A bleeds into project B. | Caches are per-project (volume `vibn-dev-cache-{projectId}`). Never share. Take the slower-first-install hit; add a Verdaccio mirror later only if it bothers anyone. |
| **AI keeps calling `gitea_file_*` instead of `shell.exec`.** | **Hard removal from AI's tool list in week 3, not soft deprecation.** Keep REST endpoints alive for a 30-day grace period for any external MCP client. After 30 days, return 410 Gone. The AI has no muscle memory; no graceful migration needed. |
| **What if the user has no Vibn project yet?** | First chat creates a project + provisions its Coolify project + spins up `vibn-dev` lazily. ~10s overhead, one-time. Stream progress to the chat ("creating workspace... installing tools..."). Same UX bolt.new uses while WebContainers boot. |
| **Coolify host disk dies → users lose unshipped `/workspace` work.** | **Auto-push to Gitea `vibn-autosave/main` branch every 5 min of activity, plus before idle-suspend.** Treat Gitea as canonical, container disk as ephemeral. Built in week 1, day 2 (not optional). |
| **Path B turns out to be wrong; we need to revert.** | **Kill-switch admin endpoint (`POST /api/admin/path-b/disable`) flips a feature flag — all new chat sessions go back to Path A; existing dev containers drain.** ~10-min revert window. Built week 1. |
---
## 8. Success metrics
We're not done until **all four** are true on the eval harness:
| Metric | Target | Today (Path A) |
|---|---|---|
| Time-to-first-preview (10 reference prompts, p50) | ≤ 60 s | ~5 min |
| Iteration loop (small edit → user sees change) p50 | ≤ 15 s | ~3 min |
| Tool calls per "build me X" task (median) | ≤ 30 | ~80 |
| End-to-end success rate (live deployable result) | ≥ 80% | ~50% |
---
## 9. What this changes about the existing roadmap
- **Tier 1.5 ("Code authoring capability") is collapsed into this doc.** C1C9 mostly disappear (replaced by `shell.exec` + `fs.edit`); C10 ("persistent agent dev workspace") **is** Path B.
- **Tier 1 P5.1P5.4 are unchanged.** Domains, email, storage, workers — still the right next infra primitives. Path B doesn't replace them; it makes the AI capable enough to actually use them.
- **Tier 2 P6.x** (backups, runtime logs, scoped keys) — unchanged.
- **`gitea_*` tools shipped 2026-04-28** are now legacy. Mark deprecated in week 3. Remove in a future cleanup once telemetry confirms zero usage.
---
## 10. Decision needed before week 1 starts
1. **Approve Path B as the primary architecture for code authoring.** (If no, this doc dies here.)
2. **Approve the dev-container-as-Coolify-service implementation choice.** Alternatives: separate dev-host, e2b self-host, Cloud Run jobs. Picked Coolify-service for zero new infra; flag if you want to revisit.
3. **Approve the deprecation of `gitea_file_*` tools.** They were shipped today; deprecating them within 3 weeks is fine if the path forward is clearer, embarrassing if we keep them around as half-working alternates.
4. **Approve the resource cap defaults** (free: 1 GB / 0.5 CPU, paid: 4 GB / 2 CPU). Or set different numbers.
Once those four are decided, week 1 starts.
---
## How to use this doc
- This is the *architectural* execution plan. The detailed task list
goes into the agent's TodoWrite per-week, not into this file.
- When an item ships, **move it from "planned" to "shipped"** in
[`AI_CAPABILITIES.md`](./AI_CAPABILITIES.md) and link the commit/PR.
- When a risk in §7 turns out to be real, document the mitigation
outcome inline so future readers see what actually happened.
- This doc supersedes the proposed Tier 1.5 in
[`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md). Add a
one-line pointer there once approved.
*(Refer to `lib/ai/vibn-tools.ts` and `app/api/mcp/route.ts` for the live implementation).*

View File

@@ -1,275 +1,11 @@
# Project Page Architecture — Product / Infrastructure / Hosting
# Project Page Architecture
> The plan to collapse the 16-page sidebar mess at
> `/[workspace]/project/[projectId]/*` into 3 founder-friendly
> sections, and to make `/project/<id>` actually reflect what the AI
> is doing in the dev container instead of stale Gitea/prod-Coolify
> data.
>
> **Companion to:** [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md)
> (Path B is the engine; this doc is the dashboard for it).
>
> **Status:** week 1 doc + home-page redesign in flight (2026-04-28).
> **Note:** The UI was heavily refactored. The primary surfaces for a project are now:
---
1. **The Plan Tab (`/plan`):** Contains the project's vision/objective document, tasks, decisions, and raw ideas. The AI acts as a scribe here.
2. **The Product Tab (`/product`):** Lists the live codebases (Gitea) and running images (Docker containers).
3. **The Infrastructure Tab (`/infrastructure`):** Lists the underlying resources (PostgreSQL databases, Redis, etc.) managed by Coolify.
4. **The Hosting Tab (`/hosting`):** Lists live runtime environments, logs, and preview URLs.
5. **The Chat Panel:** Available on all project surfaces as a slide-out, used to orchestrate work.
## 1. Why this exists
Today the project page (`/[workspace]/project/[projectId]`) shows two
tiles — Code + Infrastructure — and links to a sidebar with 16
sub-routes (`build`, `run`, `infrastructure`, `deployment`,
`overview`, `insights`, `analytics`, `prd`, `tasks`, `settings`,
`assist`, `design`, `growth`, `grow`, `mvp-setup`, `code` — the last
of which doesn't exist as a route, so the home tile is a dead link).
Two structural problems:
1. **The sidebar grew without an anchor concept.** Founders have no
mental model of what the 16 pages map to; they just see a list
and click around hoping for the right one. Half the pages are
placeholders ("Coming soon"); the rest overlap.
2. **None of the data sources have been updated for Path B.** The
Code tile reads the Gitea repo (production master branch), but the
AI now writes to the dev container's `/workspace`, often without
pushing for hours. The Infrastructure tile reads production
Coolify apps; new `dev_server.start` previews don't show up
anywhere. So when AI does great work in chat, the project page
doesn't update — the user has to tab back to chat to see anything.
---
## 2. The framing
Three sections, founder-friendly names, every project on Vibn maps
cleanly into all three:
| Section | What it is | Founder asks… |
|---|---|---|
| **Product** | Custom code, design, content built for THIS vision | *"What did I build?"* |
| **Infrastructure** | Reusable, swappable third-party services (auth, db, email, payments…) | *"What do I depend on?"* |
| **Hosting** | Where the product runs and how people reach it (Coolify, domain, observability, cost) | *"Where does it live?"* |
### The boundary rule
> **Custom code = Product. Third-party service = Infrastructure.**
> Runtime + reachability = Hosting.
Concrete edge cases:
- A custom `/api/upload` endpoint that calls S3 → endpoint is
**Product**, S3 bucket + credentials are **Infrastructure**.
- Custom job that sends a welcome email → job is **Product**, the
job runner (Sidekiq/BullMQ) and email service (Resend) are
**Infrastructure**.
- Webhook handler that processes Stripe events → handler is
**Product**, Stripe is **Infrastructure**.
- Coolify scheduled task that runs your code → your code is
**Product**, Coolify itself is **Hosting**.
---
## 3. Charters
### Product
Everything custom-built for this specific vision. The unique IP that
wouldn't exist without this product.
**Includes:**
- Frontend web app
- Marketing site
- Custom backend code & APIs
- Custom business logic
- Custom jobs / runners (the code, not the runner)
- Brand, copy, design system
- The repository itself
- Customer base — the actual users you've earned
**Rule:** if you wrote it for this product, it's Product. If it's
`node_modules` or a third-party SDK, it's not.
### Infrastructure
The reusable, swappable services your product depends on. The
annoying multi-vendor world where you have to pick a provider.
**Includes:**
- Auth provider (Clerk, Pocketbase, Authentik, Google OAuth, …)
- Database (Postgres, MySQL, MongoDB, Redis, …)
- File storage (S3, R2, MinIO)
- Email (Resend, SendGrid, SES)
- Payments (Stripe, Paddle, Lemon Squeezy)
- Analytics (Plausible, PostHog, GA)
- Search (Algolia, Meili, Typesense)
- LLM provider (OpenAI, Anthropic, Gemini, Vertex)
- Queues, maps, SMS, push notifications, …
- Secrets and API keys that wire all of the above
**Rule:** if you could swap the vendor without changing your product
code, it's Infrastructure.
### Hosting
Where the product physically runs and how people reach it.
**Includes:**
- Container runtime (Coolify in our case)
- Domain + DNS + SSL
- CDN / edge
- Observability (logs, errors, uptime)
- Backups
- Monthly cost
**Rule:** it's about *runtime and reachability,* not about what the
software does.
---
## 4. Future sections (deferred)
Add as separate top-level cards once they become real concerns:
- **Models** — for AI-heavy products: which LLMs, which embedding
model, prompt versions, eval scores, cost-per-call.
- **Analytics** — when there are real users worth measuring.
- **Marketing** — campaigns, blog, SEO, social, when there's a
growth motion.
- **Compliance** — Terms, Privacy, GDPR, SOC2, when shipping to
paying customers.
- **Support** — helpdesk, chat, status page, when there are
customers complaining.
- **Team** — when the project has more than one collaborator.
Same charter template each time. Same rule: code = Product,
swappable = Infrastructure, runs/reachable = Hosting, otherwise it
needs its own section.
---
## 5. Mapping today → tomorrow
| Today's page | Where it goes | Notes |
|---|---|---|
| `(home)/page.tsx` | New `(home)/page.tsx` (3-card grid) | Full redesign |
| `code` (404) | `product/` (new) | Stub the route, point home tile at it |
| `build` | Subroute under `product/files` (later) | Heavy 1626 lines; preserve the file tree component |
| `run` | `hosting/` | Production runtime |
| `infrastructure` | `hosting/` | Same data, different name |
| `deployment` | `hosting/deploys` (later) | Deploy history is Hosting |
| `overview` | Subroute under `product/` or merged into home | Decide once we see how home feels |
| `prd` | Subroute under `product/` (vision) | Or its own "Define" section if we add one |
| `tasks` | Subroute under `product/` (roadmap) | Or its own section later |
| `assist` | `product/` (it's emails/chat your product sends) | These ARE product features |
| `design` | `product/design` | Custom for this vision |
| `growth`, `grow`, `analytics`, `insights`, `mvp-setup` | Defer, probably absorbed into a future "Analytics" or "Marketing" section | Many are placeholders today |
| `settings` | Top-right gear (lives outside the 3 sections) | Project-level meta |
**Net:** 16 routes → 3 sections (+ settings). 8+ pages get rationalized
into nothing because they were duplicating their neighbors.
---
## 6. Phased delivery
### Phase 1 — Tab navigation + section stubs (this session)
The three sections are TABS at the project level, not a card-grid
landing page. A founder lands on the project URL and is immediately
inside Product (the default tab); flipping to Infrastructure or
Hosting is one click and stays in the same view. No
intermediate "click a tile to drill in" step.
URL shape:
```
/[workspace]/project/[id] → 308 redirect to /product
/[workspace]/project/[id]/product → Product tab
/[workspace]/project/[id]/infrastructure → Infrastructure tab
/[workspace]/project/[id]/hosting → Hosting tab
```
A shared layout at the project root renders:
- Project header (name, vision, stage pill, settings gear)
- Tab bar (Product · Infrastructure · Hosting) — active tab
highlighted; each tab carries a tiny status dot (green/amber/grey)
- Slot for the active tab's page
The current `(home)/page.tsx` (the two-tile landing) is replaced by
the redirect.
**Don't kill anything in `(workspace)/`.** Existing 16 routes stay
alive while we migrate. Sidebar still works for them.
### Phase 2 — Wire data sources
- **Product card** reads from the dev container's `/workspace`:
- File count + recent edits via `fs.list` against the project's
dev container
- User count from the project's auth provider (Pocketbase /
Clerk / etc.)
- Frontend URL from `dev_server.list` or production `apps_list`
- **Infrastructure card** reads from Coolify databases, env vars,
and known integrations:
- Database type + size
- Auth provider name
- Wired services (any env var matching `STRIPE_*`, `RESEND_*`,
etc.)
- **Hosting card** reads from Coolify apps + domains + container metrics:
- Production URL, SSL status, last deploy
- Monthly cost (Coolify resource usage × pricing)
- Recent error count (from logs)
### Phase 3 — Section detail pages
Build each of `/product`, `/infrastructure`, `/hosting` as a real,
useful surface. Each page can have internal subnav for the bits
listed in its charter (e.g., Product has Frontend, Backend, Jobs,
Brand, Customers; Infrastructure has Auth, DB, Storage, Email,
Payments, …).
### Phase 4 — Migration / deletion
Once the new structure is proven, redirect the legacy routes:
- `code``product`
- `build``product/files`
- `run``hosting`
- `infrastructure``hosting`
- `deployment``hosting/deploys`
- `prd`, `tasks`, `assist``product/...`
- `growth`, `grow`, `analytics`, `insights`, `mvp-setup` → soft-delete
with a tombstone redirect to `product` or to a future section page.
---
## 7. Open questions
- **Where do the chat threads live?** They're a per-project
conversation surface today (right rail in the chat panel). I'd
argue they're not a section — they're *across* sections, like the
AI is. Keep as the persistent right rail.
- **Settings is technically project-level meta**, not one of the
three sections. Where does it surface? Gear icon in the page
header, opens settings as a side sheet or as a separate route.
Decide when we get there.
- **Mobile layout** — three cards stack vertically; no special
layout needed. The section detail pages need a layout pass when
we get to phase 3.
---
## 8. Success criteria
You should be able to look at `/project/<id>` after AI activity in
chat and immediately see:
1. *"What did the AI just build?"* → Product card updated count of
files + recent diffs.
2. *"What's it depending on?"* → Infrastructure card shows the new
Postgres, the new Stripe key, etc.
3. *"Is it live?"* → Hosting card shows the dev preview URL or the
production URL with status.
If any of those three answers requires going back to the chat or
checking another page, the redesign hasn't worked.
*(Refer to `vibn-frontend/app/[workspace]/project/[projectId]` for the UI implementation).*

View File

@@ -1,258 +1,9 @@
# Sentry-as-Product — Proposal
# Sentry as a Product (Shipped)
> Today's Sentry wiring catches errors in **the Vibn platform**.
> The bigger opportunity is wiring Sentry into **every project Vibn
> ships**, then feeding those errors back into the user's AI chat.
> Difference between "an AI that codes" and "an AI that owns the
> product."
> **Note:** This spec was implemented in May 2026.
## TL;DR
Today, when a Vibn user's deployed app crashes for real users:
```
real user → site 500s → user closes tab, never tells founder
→ founder finds out hours/days later (or never)
→ AI in Vibn chat has zero idea anything is wrong
```
The fix is to make every Vibn project ship with Sentry pre-wired,
then expose the error feed to the AI as a tool. Total effort:
**~8 hours**, in 4 stages, each independently shippable.
| Stage | Capability | Effort | Unlocks |
|---|---|---|---|
| 1 | Auto-provision a Sentry project per Vibn project on first deploy | ~3 hr | Real-user errors captured at all |
| 2 | Bake Sentry into every scaffold template | ~2 hr | Capture works without user setup |
| 3 | Add `project_recent_errors` MCP tool for the AI | ~2 hr | AI can answer "is anything broken?" |
| 4 | Auto-surface unresolved errors at chat-turn start | ~1 hr | AI proactively offers fixes |
Total: **~8 hr**, no new infra (we already have Sentry org access,
Coolify env API, scaffold templates, MCP tool registry).
---
## Why this is the right next investment
### The current loop is broken at the seam between user and platform
Vibn's value proposition is "the AI is your technical co-founder."
That promise breaks the moment the AI's last commit causes a real
user error and the AI doesn't know about it. The current loop:
```
1. User describes feature in chat
2. AI ships code
3. AI says "deployed, give it a try"
4. (silence)
5. Real users hit edge cases → 500s → bounce
6. Founder eventually notices via support ticket / analytics dip
7. Founder pastes error back to AI
8. AI fixes
```
Steps 46 are dead air for the founder, **and the AI cannot help
during them.** This is the gap that separates Vibn from "any IDE
with an LLM."
### What it looks like with this proposal shipped
```
1. User describes feature in chat
2. AI ships code
3. AI says "deployed, give it a try"
4. Real users hit edge cases → 500s → Sentry captures
5. (Founder opens Vibn chat 3 hrs later for unrelated reason)
6. AI: "Hey — checkout has 500'd for 3 users in the last hour
because `customer.email` is undefined on
app/checkout/route.ts:47. Want me to fix it?"
7. AI fixes, deploys, marks issue resolved in Sentry
```
The AI becomes the on-call engineer. This is what "technical
co-founder" actually means and we are 8 hours away from it.
### Why now (not Phase 4)
- The Sentry wiring we just shipped for vibn-frontend gave us:
- A working Sentry org (`vibnai`)
- An auth token with project-management scope
- Verified knowledge that the build args / source maps flow works
- A working `withSentryConfig` recipe in `vibn-frontend/next.config.ts`
- All of those are reusable for stage 1 and 2 of this proposal.
- Doing this **before** the beta means user projects start emitting
error data on day one, so by the time we're debugging real beta
user pain, we have a month of history to reason about.
- Doing it after the beta means we'd have to retroactively
instrument projects that have already been deployed for weeks.
---
## Stage 1 — Auto-provision a Sentry project per Vibn project (~3 hr)
**Goal:** when a user creates a Vibn project, the platform creates a
matching Sentry project under the `vibnai` org and stashes the DSN
+ auth token in Coolify env vars on the user's app.
**What gets built:**
1. **A `provisionSentryProject(projectId, name)` helper** in
`vibn-frontend/lib/integrations/sentry.ts`. Calls Sentry's
`POST /api/0/teams/vibnai/{team}/projects/` with the project
slug, returns the DSN.
2. **Hook into project-create flow** — on first successful deploy,
call the helper and write the resulting DSN + auth token into
Coolify env vars (`NEXT_PUBLIC_SENTRY_DSN`,
`SENTRY_AUTH_TOKEN`) for that app via the same Coolify API we
used today.
3. **Idempotency** — if the Sentry project already exists, fetch
its DSN instead of creating a duplicate. Same project name
convention every time: `vibn-{workspace}-{projectSlug}`.
4. **Storage** — store `sentryProjectSlug` and `sentryAuthTokenId`
on the Postgres `projects` row so we can look them up later
without re-walking the Sentry org.
**Risk:** Sentry's API rate-limits team-project creation. We bypass
this by reading-before-writing, so the only API cost on subsequent
deploys is one GET.
**Definition of done:** create a fresh Vibn project → check Sentry
org → see a project named `vibn-{ws}-{slug}` → check Coolify env on
that app → see DSN populated.
---
## Stage 2 — Bake Sentry into every scaffold template (~2 hr)
**Goal:** every Next.js / Vite / etc. starter template Vibn ships
already has Sentry wired up. User does nothing.
**What gets built:**
1. **For each scaffold template in `vibn-frontend/lib/scaffold/`**,
add the same files we shipped today:
- `instrumentation.ts`
- `instrumentation-client.ts`
- `app/global-error.tsx` (Next.js) / equivalent boundary (Vite)
- `next.config.ts` wrapped with `withSentryConfig` (Next.js)
- `vite.config.ts` with `sentryVitePlugin` (Vite)
- `Dockerfile` ARG declarations for `NEXT_PUBLIC_SENTRY_DSN` +
`SENTRY_AUTH_TOKEN`
2. **Add `@sentry/nextjs` (or `@sentry/react` + `@sentry/vite-plugin`)
to each template's `package.json` `dependencies`.**
3. **Document in template README** that Sentry is pre-wired and the
user doesn't need to do anything.
**Risk:** Sentry's wrapper sometimes interacts badly with custom
build configs (e.g. monorepos, custom webpack rules). Mitigation:
the `errorHandler` we set today (`console.warn` instead of throw)
ensures source map upload failures don't break builds.
**Definition of done:** scaffold a fresh Next.js project from Vibn
templates → deploy → throw a test error → see it in Sentry,
de-minified.
---
## Stage 3 — Expose error feed to the AI as MCP tools (~2 hr)
**Goal:** the AI can ask Sentry "what's broken in project X?" and
get a real answer.
**What gets built:**
Three new MCP tools in `vibn-frontend/lib/ai/vibn-tools.ts`:
1. **`project_recent_errors { projectId, since?, limit? }`**
- Returns: `[{ id, title, count, lastSeen, culprit, level }]`
- Default `since`: 24h. Default `limit`: 10.
- Filters to unresolved issues only.
- Implementation: read `sentryProjectSlug` off the project row,
call Sentry's `GET /api/0/projects/{org}/{slug}/issues/`.
2. **`project_error_detail { projectId, issueId }`**
- Returns: `{ stacktrace, breadcrumbs, request, user, replay_url }`
- Implementation: Sentry's `GET /api/0/issues/{id}/events/latest/`.
3. **`project_error_resolve { projectId, issueId }`**
- Side-effect: marks the issue resolved in Sentry.
- Used by the AI after it ships a fix and confirms via tests.
- Implementation: Sentry's `PUT /api/0/issues/{id}/` with
`status: "resolved"`.
**Auth:** token storage is per-project (from Stage 1's `projects`
row). Each project's AI sees only its own project's errors. No
cross-project leakage.
**Definition of done:** in a Vibn chat for a project with known
errors, ask the AI "any errors lately?" → AI calls
`project_recent_errors` → shows real list.
---
## Stage 4 — Auto-surface unresolved errors at chat-turn start (~1 hr)
**Goal:** the AI doesn't wait to be asked. When the user opens a
chat and there are unresolved errors, the AI mentions them on the
first turn.
**What gets built:**
In `vibn-frontend/app/api/chat/route.ts`, at the start of each chat
turn (before calling the model):
1. Call the same `project_recent_errors` logic Stage 3 exposed.
2. If `count > 0`, prepend a synthetic system message:
```
[PROJECT HEALTH]
{N} unresolved Sentry issues in the last 24 hours:
- {title} (×{count}, last seen {time}) — {culprit}
- ...
If the user's first message is unrelated to these, you may still
proactively mention them: "Quick FYI before we get into that —
{X} has been failing for users."
If their message IS about a broken thing, prefer the matching
Sentry issue's stack trace over guessing.
```
3. Only fire this once per N chat turns (configurable, default 1
per session opening) — we don't want to spam every turn.
**Risk:** false alarms (Sentry issue from yesterday's deploy that
no one cares about anymore) make the AI annoying. Mitigation:
tighten the `since` window to the last 6h, and only surface issues
with `count >= 2` (one-off errors don't count).
**Definition of done:** intentionally break a deployed user
project, open chat, type "what's up?" → AI's first response
mentions the issue, with file path.
---
## Out of scope for this proposal
- **User-owned Sentry orgs.** Some users will eventually want their
own Sentry account, not the shared `vibnai` org. Ship-later;
doesn't block the loop. Easy retrofit because storage is already
per-project.
- **Performance / Tracing data.** Sentry also captures spans /
traces. Useful for "this endpoint is slow" but not the urgent
product loop. Ship-later.
- **Front-end UI for errors in Vibn.** A "Health" tab showing the
Sentry feed in the Vibn UI is nice but not required for the AI
loop to work. Ship-later.
---
## Recommendation
Add a **Phase 2.9 (Sentry-as-product loop)** to `BETA_LAUNCH_PLAN.md`
covering Stages 14 as a single bundle. Estimate: **8 hr engineering**.
This is the second-highest-leverage item still ahead of beta,
behind only the deploy-failed webhook (which is 30 min). Every
hour spent here directly upgrades the value of every other beta
test session that follows it.
## Architecture
- Sentry is automatically provisioned for every new project (`lib/integrations/sentry.ts`).
- Environment variables (`NEXT_PUBLIC_SENTRY_DSN` and `SENTRY_AUTH_TOKEN`) are injected into the Coolify app.
- The AI has access to `project_recent_errors`, `project_error_detail`, and `project_error_resolve` MCP tools to automatically read, diagnose, and fix exceptions directly from the Sentry API.
- If unhandled exceptions are firing, the AI is prompted at the start of a conversation to address them (`app/api/chat/route.ts`).