docs: add AI harness gaps proposal — orphan-recovery, browser tool, UI state, diff preview

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-01 10:25:50 -07:00
parent 0f0a19e50e
commit 24065f172f
1 changed files with 227 additions and 0 deletions
--- a/AI_HARNESS_GAPS.md
+++ b/AI_HARNESS_GAPS.md
@@ -0,0 +1,227 @@
+# AI Harness Gaps — Proposal
+
+> Four gaps in the Vibn AI experience that are **structural, not promptable**.
+> Each one is responsible for a specific failure pattern visible in real
+> production chat transcripts. None of them are scoped in
+> [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md),
+> [`BETA_LAUNCH_PLAN.md`](./BETA_LAUNCH_PLAN.md),
+> [`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md), or the
+> agent-execution / telemetry-streaming designs.
+>
+> **Drafted:** 2026-04-30 (after a transcript review of the Dr Dave + Twenty CRM threads).
+>
+> **Why these four:** they share a common shape — the model is doing what
+> the prompt told it to, and still producing a bad outcome. The fix lives
+> in the *harness around the model*, not in instructions to the model.
+
+---
+
+## TL;DR
+
+| # | Gap | Failure pattern in prod | Fix size |
+|---|---|---|---|
+| 1 | Tool-error recovery middleware | Orphan twenty-* services (4 shipped). Model keeps delete-and-recreating despite explicit prompt rule against it. | ~2 hr |
+| 2 | Browser-driver tool for the AI | "Should be live in 10s" — AI ships URLs without ever loading them; user discovers the 502. | ~4 hr |
+| 3 | Live UI state attached to chat messages | "this isn't working" / "fix the URL" with no signal of which "this". AI guesses, often wrong. | ~3 hr |
+| 4 | Diff preview / accept-changes gate | `fs_edit` writes straight to the dev container with no review surface. Fine for sub-second iteration; bad for prod-bound edits. | ~6 hr |
+
+Total: ~15 hr of work. None require new infra.
+
+---
+
+## Gap 1 — Tool-error recovery middleware (highest ROI)
+
+**Failure observed:** in thread `d698ef40-…` ("Hey there, what can you see about this project?"), the AI hit
+`Conflict. The container name "/postgres-…" is already in use` **three separate times**.
+On each attempt it responded by *creating a new service with a new name*,
+not by calling `apps_unstick`. The prompt explicitly tells it not to do
+this and tells it the recovery sequence. The model still did it.
+
+**Why prompt rules fail here:** the model treats the system prompt as
+soft guidance against a 30k-token document; the tool result is concrete
+and 200ms-fresh. When tool reality contradicts prompt rules, tool
+reality wins.
+
+**Proposed fix:** middleware in `executeMcpTool` that pattern-matches
+known-recoverable errors and **injects a synthetic system message** into
+the conversation before the next round. The model can't ignore an
+injected instruction the way it can ignore a static prompt rule.
+
+```ts
+// In app/api/chat/route.ts, around the executeMcpTool call:
+const errorRecovery = detectKnownError(result);
+if (errorRecovery) {
+  messages.push({
+    role: "system",
+    content: `[RECOVERY] ${errorRecovery.diagnosis}. Required next action: ${errorRecovery.fix}. Do NOT ${errorRecovery.antipattern}.`,
+  });
+}
+```
+
+**Initial recovery rules** (high-confidence, low-false-positive):
+
+| Error signature | Diagnosis | Fix | Antipattern |
+|---|---|---|---|
+| `Conflict. The container name … is already in use` | Orphan container blocking new boot | `apps_unstick { uuid }` then `apps_deploy { uuid }` | Delete and recreate with a new name |
+| `pull access denied` / `manifest unknown` | Image not on the host yet | `apps_repair { uuid }` | Retry deploy without addressing the cause |
+| `port … is already allocated` | Another container holds the port | List containers, identify holder, decide | Pick a random different port |
+
+**Effort:** ~2 hr. New file `lib/ai/error-recovery.ts` with a registry of
+patterns + the injection in the chat route. Each rule is ~10 lines.
+
+**Slot into:** `BETA_LAUNCH_PLAN.md` Phase 2 (Stability & visibility) — fits next to 2.4 (deployment-failed webhook).
+
+---
+
+## Gap 2 — Browser-driver tool for the AI
+
+**Failure observed:** in the same Twenty thread, the AI said *"It's
+fully deployed, healthy, and I've verified it's returning a 200 OK
+status"* — but the user saw "Unable to Reach Back-end" on the actual
+page. The AI checked Coolify's status reporting, not the rendered app.
+Also visible in the Dr Dave thread: *"Note: it might take 10-15 seconds
+on the very first load for the DNS to propagate"* — the AI hedged
+because it couldn't load the URL itself.
+
+**Why this matters for beta:** every "I deployed it" claim is unverified
+unless the AI can open the URL. Sentry (planned in P2.3) catches
+errors *after a user hits them*. A browser tool catches errors
+*before any user hits them*.
+
+**Proposed fix:** add a `browser.*` MCP tool surface backed by a
+headless Chromium running on the Coolify host (or in the vibn-dev
+container). Initial tools:
+
+| Tool | Purpose |
+|---|---|
+| `browser.navigate { url, timeoutMs? }` | Load the URL, return final URL + status code + page title |
+| `browser.screenshot { url }` | Visual confirmation. Return base64 PNG (or store in GCS) |
+| `browser.console_logs { url }` | Capture client-side JS errors (the `TypeError: reading 'z'/'j'/'aa'` from BETA P2.2 would be findable this way) |
+| `browser.fetch { url, headers? }` | HTTP-level smoke test. Subset of `http_fetch` but always from inside Vibn's network |
+
+**Implementation:** Playwright already has an MCP server (`@modelcontextprotocol/server-playwright`).
+Wire it as a Coolify service, expose via the same per-workspace MCP
+token Vibn already issues.
+
+**Effort:** ~4 hr. ~2 hr to deploy Playwright as a service, ~1 hr to
+add tool definitions, ~1 hr to wire prompt instructions ("after any
+deploy or `dev_server.start`, call `browser.navigate` to confirm").
+
+**Slot into:** Phase 2 (Stability & visibility) — pairs with the
+runtime error chase (2.1, 2.2) and the Sentry wiring (2.3).
+
+---
+
+## Gap 3 — Live UI state attached to chat messages
+
+**Failure observed:** in the Dr Dave thread, user typed *"are you able
+to give me a preview url?"* The AI didn't know which port the
+Next.js dev server would bind to, what was already running, or
+whether the user was looking at the chat or another tab. It
+guessed and re-discovered everything from scratch.
+
+In the Twenty thread, *"can you see the different sections?"* — user
+meant Plan tab sections (Vision/Tasks/Decisions/Ideas). AI listed
+metadata. No way to know.
+
+**Why prompt rules can't fix this:** the AI literally lacks the
+information.
+
+**Proposed fix:** the chat panel sends a small `uiContext` object
+alongside every user message. Inject into the system prompt as a
+dynamic block (same shape as `activeBlock`):
+
+```ts
+{
+  currentRoute: "/mark-account/project/abc/hosting",
+  currentTab: "hosting",
+  visibleResources: [
+    { kind: "app", uuid: "y4cs…", name: "vibn-frontend" },
+    { kind: "service", uuid: "igcp…", name: "vibn-dev-twenty-crm" },
+  ],
+  lastUserActions: [
+    { at: "2m ago", action: "opened twenty-crm logs" },
+    { at: "5m ago", action: "switched to Hosting tab" },
+  ],
+}
+```
+
+System-prompt block becomes:
+
+> The user is currently looking at the **Hosting tab** (route: `…/hosting`).
+> Visible resources: `vibn-frontend`, `vibn-dev-twenty-crm`.
+> Recent actions: opened twenty-crm logs (2m ago), switched to Hosting (5m ago).
+> When the user says "this" / "it" / "the URL" — assume they mean
+> something visible in the current viewport unless they name something else.
+
+**Effort:** ~3 hr. ~1 hr to wire the chat panel's
+`uiContext` collection (existing route + tab state, last 5 actions
+from a small ring buffer in the panel), ~1 hr to plumb through the
+chat API, ~1 hr to add the prompt block.
+
+**Slot into:** Phase 3 (UX surfaces) — pairs with 3.2 (structured
+errors in chat) and 3.3 (empty-state nudges).
+
+---
+
+## Gap 4 — Diff preview / accept-changes gate
+
+**Failure observed:** none yet, but the surface is exposed today —
+`fs_edit` writes directly to `/workspace` in the dev container. For
+ephemeral exploration this is correct (sub-second iteration is the
+whole Path B point). For changes destined to ship, the user has no
+review surface; they only see what changed after the AI summarizes.
+
+**Why this matters for beta:** the moment a paying user wants to
+"see what the AI changed before it goes live," there's nothing to
+show them. Cursor's whole UX is built on diffs the user accepts.
+
+**Proposed fix:** two-mode `fs_edit` / `fs_write`:
+
+1. **Direct mode (default for dev container):** write immediately. Current
+   behavior. Fine for "make the button blue" iteration.
+2. **Staged mode (default when `ship` is the next likely action):**
+   write to a shadow path, surface a diff in the chat UI, gate the
+   real write on a one-click "Accept" button.
+
+The model decides which mode based on context — or simpler: stage when
+the file is in a "protected" set (e.g. `prisma/schema.prisma`,
+`Dockerfile`, `package.json`, anything in `prod/` or `migrations/`),
+direct otherwise.
+
+**Effort:** ~6 hr. ~2 hr backend (shadow write + apply endpoint),
+~3 hr UI (diff renderer in the chat panel, accept/reject buttons),
+~1 hr prompt + tool changes.
+
+**Slot into:** Phase 4 (Onboarding & safety) — pairs with 4.5 (auth
+hardening) and 4.6 (compute quotas) as part of "what a stranger
+needs day 1."
+
+---
+
+## Suggested sequencing
+
+If we ship in priority order:
+
+1. **Gap 1 first** — kills the worst pattern in prod for ~2 hr of work. Should be ahead of any new feature in Phase 2.
+2. **Gap 2 second** — closes the verify-deploy loop. Multiplies the value of every subsequent AI-shipped change because it's no longer blind.
+3. **Gap 3 third** — tighter conversational UX. Once 1 and 2 work, the remaining UX cliff is "AI doesn't know what I'm looking at."
+4. **Gap 4 last** — only matters once we have paying users editing prod-bound code. Pre-beta optional.
+
+Total effort to ship 1+2+3 (the meaningful UX wins): **~9 hours.**
+
+---
+
+## How this changes BETA_LAUNCH_PLAN.md
+
+Two new tasks slot in:
+
+- **P2.8** Tool-error recovery middleware (Gap 1) — block on nothing, ship before P2.4.
+- **P2.9** Browser-driver MCP tool (Gap 2) — block on nothing.
+
+One new task in P3:
+
+- **P3.7** UI-state injection into chat (Gap 3) — block on nothing.
+
+Gap 4 stays out of beta scope unless eval reveals real damage from
+unstaged edits.