From 24065f172fc8eca2490433108995c8da8369e5f5 Mon Sep 17 00:00:00 2001 From: mawkone Date: Fri, 1 May 2026 10:25:50 -0700 Subject: [PATCH] =?UTF-8?q?docs:=20add=20AI=20harness=20gaps=20proposal=20?= =?UTF-8?q?=E2=80=94=20orphan-recovery,=20browser=20tool,=20UI=20state,=20?= =?UTF-8?q?diff=20preview?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Cursor --- AI_HARNESS_GAPS.md | 227 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 AI_HARNESS_GAPS.md diff --git a/AI_HARNESS_GAPS.md b/AI_HARNESS_GAPS.md new file mode 100644 index 0000000..46bbad7 --- /dev/null +++ b/AI_HARNESS_GAPS.md @@ -0,0 +1,227 @@ +# AI Harness Gaps — Proposal + +> Four gaps in the Vibn AI experience that are **structural, not promptable**. +> Each one is responsible for a specific failure pattern visible in real +> production chat transcripts. None of them are scoped in +> [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md), +> [`BETA_LAUNCH_PLAN.md`](./BETA_LAUNCH_PLAN.md), +> [`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md), or the +> agent-execution / telemetry-streaming designs. +> +> **Drafted:** 2026-04-30 (after a transcript review of the Dr Dave + Twenty CRM threads). +> +> **Why these four:** they share a common shape — the model is doing what +> the prompt told it to, and still producing a bad outcome. The fix lives +> in the *harness around the model*, not in instructions to the model. + +--- + +## TL;DR + +| # | Gap | Failure pattern in prod | Fix size | +|---|---|---|---| +| 1 | Tool-error recovery middleware | Orphan twenty-* services (4 shipped). Model keeps delete-and-recreating despite explicit prompt rule against it. | ~2 hr | +| 2 | Browser-driver tool for the AI | "Should be live in 10s" — AI ships URLs without ever loading them; user discovers the 502. | ~4 hr | +| 3 | Live UI state attached to chat messages | "this isn't working" / "fix the URL" with no signal of which "this". AI guesses, often wrong. | ~3 hr | +| 4 | Diff preview / accept-changes gate | `fs_edit` writes straight to the dev container with no review surface. Fine for sub-second iteration; bad for prod-bound edits. | ~6 hr | + +Total: ~15 hr of work. None require new infra. + +--- + +## Gap 1 — Tool-error recovery middleware (highest ROI) + +**Failure observed:** in thread `d698ef40-…` ("Hey there, what can you see about this project?"), the AI hit +`Conflict. The container name "/postgres-…" is already in use` **three separate times**. +On each attempt it responded by *creating a new service with a new name*, +not by calling `apps_unstick`. The prompt explicitly tells it not to do +this and tells it the recovery sequence. The model still did it. + +**Why prompt rules fail here:** the model treats the system prompt as +soft guidance against a 30k-token document; the tool result is concrete +and 200ms-fresh. When tool reality contradicts prompt rules, tool +reality wins. + +**Proposed fix:** middleware in `executeMcpTool` that pattern-matches +known-recoverable errors and **injects a synthetic system message** into +the conversation before the next round. The model can't ignore an +injected instruction the way it can ignore a static prompt rule. + +```ts +// In app/api/chat/route.ts, around the executeMcpTool call: +const errorRecovery = detectKnownError(result); +if (errorRecovery) { + messages.push({ + role: "system", + content: `[RECOVERY] ${errorRecovery.diagnosis}. Required next action: ${errorRecovery.fix}. Do NOT ${errorRecovery.antipattern}.`, + }); +} +``` + +**Initial recovery rules** (high-confidence, low-false-positive): + +| Error signature | Diagnosis | Fix | Antipattern | +|---|---|---|---| +| `Conflict. The container name … is already in use` | Orphan container blocking new boot | `apps_unstick { uuid }` then `apps_deploy { uuid }` | Delete and recreate with a new name | +| `pull access denied` / `manifest unknown` | Image not on the host yet | `apps_repair { uuid }` | Retry deploy without addressing the cause | +| `port … is already allocated` | Another container holds the port | List containers, identify holder, decide | Pick a random different port | + +**Effort:** ~2 hr. New file `lib/ai/error-recovery.ts` with a registry of +patterns + the injection in the chat route. Each rule is ~10 lines. + +**Slot into:** `BETA_LAUNCH_PLAN.md` Phase 2 (Stability & visibility) — fits next to 2.4 (deployment-failed webhook). + +--- + +## Gap 2 — Browser-driver tool for the AI + +**Failure observed:** in the same Twenty thread, the AI said *"It's +fully deployed, healthy, and I've verified it's returning a 200 OK +status"* — but the user saw "Unable to Reach Back-end" on the actual +page. The AI checked Coolify's status reporting, not the rendered app. +Also visible in the Dr Dave thread: *"Note: it might take 10-15 seconds +on the very first load for the DNS to propagate"* — the AI hedged +because it couldn't load the URL itself. + +**Why this matters for beta:** every "I deployed it" claim is unverified +unless the AI can open the URL. Sentry (planned in P2.3) catches +errors *after a user hits them*. A browser tool catches errors +*before any user hits them*. + +**Proposed fix:** add a `browser.*` MCP tool surface backed by a +headless Chromium running on the Coolify host (or in the vibn-dev +container). Initial tools: + +| Tool | Purpose | +|---|---| +| `browser.navigate { url, timeoutMs? }` | Load the URL, return final URL + status code + page title | +| `browser.screenshot { url }` | Visual confirmation. Return base64 PNG (or store in GCS) | +| `browser.console_logs { url }` | Capture client-side JS errors (the `TypeError: reading 'z'/'j'/'aa'` from BETA P2.2 would be findable this way) | +| `browser.fetch { url, headers? }` | HTTP-level smoke test. Subset of `http_fetch` but always from inside Vibn's network | + +**Implementation:** Playwright already has an MCP server (`@modelcontextprotocol/server-playwright`). +Wire it as a Coolify service, expose via the same per-workspace MCP +token Vibn already issues. + +**Effort:** ~4 hr. ~2 hr to deploy Playwright as a service, ~1 hr to +add tool definitions, ~1 hr to wire prompt instructions ("after any +deploy or `dev_server.start`, call `browser.navigate` to confirm"). + +**Slot into:** Phase 2 (Stability & visibility) — pairs with the +runtime error chase (2.1, 2.2) and the Sentry wiring (2.3). + +--- + +## Gap 3 — Live UI state attached to chat messages + +**Failure observed:** in the Dr Dave thread, user typed *"are you able +to give me a preview url?"* The AI didn't know which port the +Next.js dev server would bind to, what was already running, or +whether the user was looking at the chat or another tab. It +guessed and re-discovered everything from scratch. + +In the Twenty thread, *"can you see the different sections?"* — user +meant Plan tab sections (Vision/Tasks/Decisions/Ideas). AI listed +metadata. No way to know. + +**Why prompt rules can't fix this:** the AI literally lacks the +information. + +**Proposed fix:** the chat panel sends a small `uiContext` object +alongside every user message. Inject into the system prompt as a +dynamic block (same shape as `activeBlock`): + +```ts +{ + currentRoute: "/mark-account/project/abc/hosting", + currentTab: "hosting", + visibleResources: [ + { kind: "app", uuid: "y4cs…", name: "vibn-frontend" }, + { kind: "service", uuid: "igcp…", name: "vibn-dev-twenty-crm" }, + ], + lastUserActions: [ + { at: "2m ago", action: "opened twenty-crm logs" }, + { at: "5m ago", action: "switched to Hosting tab" }, + ], +} +``` + +System-prompt block becomes: + +> The user is currently looking at the **Hosting tab** (route: `…/hosting`). +> Visible resources: `vibn-frontend`, `vibn-dev-twenty-crm`. +> Recent actions: opened twenty-crm logs (2m ago), switched to Hosting (5m ago). +> When the user says "this" / "it" / "the URL" — assume they mean +> something visible in the current viewport unless they name something else. + +**Effort:** ~3 hr. ~1 hr to wire the chat panel's +`uiContext` collection (existing route + tab state, last 5 actions +from a small ring buffer in the panel), ~1 hr to plumb through the +chat API, ~1 hr to add the prompt block. + +**Slot into:** Phase 3 (UX surfaces) — pairs with 3.2 (structured +errors in chat) and 3.3 (empty-state nudges). + +--- + +## Gap 4 — Diff preview / accept-changes gate + +**Failure observed:** none yet, but the surface is exposed today — +`fs_edit` writes directly to `/workspace` in the dev container. For +ephemeral exploration this is correct (sub-second iteration is the +whole Path B point). For changes destined to ship, the user has no +review surface; they only see what changed after the AI summarizes. + +**Why this matters for beta:** the moment a paying user wants to +"see what the AI changed before it goes live," there's nothing to +show them. Cursor's whole UX is built on diffs the user accepts. + +**Proposed fix:** two-mode `fs_edit` / `fs_write`: + +1. **Direct mode (default for dev container):** write immediately. Current + behavior. Fine for "make the button blue" iteration. +2. **Staged mode (default when `ship` is the next likely action):** + write to a shadow path, surface a diff in the chat UI, gate the + real write on a one-click "Accept" button. + +The model decides which mode based on context — or simpler: stage when +the file is in a "protected" set (e.g. `prisma/schema.prisma`, +`Dockerfile`, `package.json`, anything in `prod/` or `migrations/`), +direct otherwise. + +**Effort:** ~6 hr. ~2 hr backend (shadow write + apply endpoint), +~3 hr UI (diff renderer in the chat panel, accept/reject buttons), +~1 hr prompt + tool changes. + +**Slot into:** Phase 4 (Onboarding & safety) — pairs with 4.5 (auth +hardening) and 4.6 (compute quotas) as part of "what a stranger +needs day 1." + +--- + +## Suggested sequencing + +If we ship in priority order: + +1. **Gap 1 first** — kills the worst pattern in prod for ~2 hr of work. Should be ahead of any new feature in Phase 2. +2. **Gap 2 second** — closes the verify-deploy loop. Multiplies the value of every subsequent AI-shipped change because it's no longer blind. +3. **Gap 3 third** — tighter conversational UX. Once 1 and 2 work, the remaining UX cliff is "AI doesn't know what I'm looking at." +4. **Gap 4 last** — only matters once we have paying users editing prod-bound code. Pre-beta optional. + +Total effort to ship 1+2+3 (the meaningful UX wins): **~9 hours.** + +--- + +## How this changes BETA_LAUNCH_PLAN.md + +Two new tasks slot in: + +- **P2.8** Tool-error recovery middleware (Gap 1) — block on nothing, ship before P2.4. +- **P2.9** Browser-driver MCP tool (Gap 2) — block on nothing. + +One new task in P3: + +- **P3.7** UI-state injection into chat (Gap 3) — block on nothing. + +Gap 4 stays out of beta scope unless eval reveals real damage from +unstaged edits.