Archived

This repository has been archived on 2026-06-07. You can view files and clone it. You cannot open issues or pull requests or push a commit.

Files

mawkone 24065f172f docs: add AI harness gaps proposal — orphan-recovery, browser tool, UI state, diff preview

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-01 10:25:50 -07:00

9.9 KiB

Raw Blame History

AI Harness Gaps — Proposal

Four gaps in the Vibn AI experience that are structural, not promptable. Each one is responsible for a specific failure pattern visible in real production chat transcripts. None of them are scoped in AI_PATH_B_EXECUTION_PLAN.md, BETA_LAUNCH_PLAN.md, AI_CAPABILITIES_ROADMAP.md, or the agent-execution / telemetry-streaming designs.

Drafted: 2026-04-30 (after a transcript review of the Dr Dave + Twenty CRM threads).

Why these four: they share a common shape — the model is doing what the prompt told it to, and still producing a bad outcome. The fix lives in the harness around the model, not in instructions to the model.

TL;DR

#	Gap	Failure pattern in prod	Fix size
1	Tool-error recovery middleware	Orphan twenty-* services (4 shipped). Model keeps delete-and-recreating despite explicit prompt rule against it.	~2 hr
2	Browser-driver tool for the AI	"Should be live in 10s" — AI ships URLs without ever loading them; user discovers the 502.	~4 hr
3	Live UI state attached to chat messages	"this isn't working" / "fix the URL" with no signal of which "this". AI guesses, often wrong.	~3 hr
4	Diff preview / accept-changes gate	`fs_edit` writes straight to the dev container with no review surface. Fine for sub-second iteration; bad for prod-bound edits.	~6 hr

Total: ~15 hr of work. None require new infra.

Gap 1 — Tool-error recovery middleware (highest ROI)

Failure observed: in thread d698ef40-… ("Hey there, what can you see about this project?"), the AI hit Conflict. The container name "/postgres-…" is already in use three separate times. On each attempt it responded by creating a new service with a new name, not by calling apps_unstick. The prompt explicitly tells it not to do this and tells it the recovery sequence. The model still did it.

Why prompt rules fail here: the model treats the system prompt as soft guidance against a 30k-token document; the tool result is concrete and 200ms-fresh. When tool reality contradicts prompt rules, tool reality wins.

Proposed fix: middleware in executeMcpTool that pattern-matches known-recoverable errors and injects a synthetic system message into the conversation before the next round. The model can't ignore an injected instruction the way it can ignore a static prompt rule.

// In app/api/chat/route.ts, around the executeMcpTool call:
const errorRecovery = detectKnownError(result);
if (errorRecovery) {
  messages.push({
    role: "system",
    content: `[RECOVERY] ${errorRecovery.diagnosis}. Required next action: ${errorRecovery.fix}. Do NOT ${errorRecovery.antipattern}.`,
  });
}

Initial recovery rules (high-confidence, low-false-positive):

Error signature	Diagnosis	Fix	Antipattern
`Conflict. The container name … is already in use`	Orphan container blocking new boot	`apps_unstick { uuid }` then `apps_deploy { uuid }`	Delete and recreate with a new name
`pull access denied` / `manifest unknown`	Image not on the host yet	`apps_repair { uuid }`	Retry deploy without addressing the cause
`port … is already allocated`	Another container holds the port	List containers, identify holder, decide	Pick a random different port

Effort: ~2 hr. New file lib/ai/error-recovery.ts with a registry of patterns + the injection in the chat route. Each rule is ~10 lines.

Slot into: BETA_LAUNCH_PLAN.md Phase 2 (Stability & visibility) — fits next to 2.4 (deployment-failed webhook).

Gap 2 — Browser-driver tool for the AI

Failure observed: in the same Twenty thread, the AI said "It's fully deployed, healthy, and I've verified it's returning a 200 OK status" — but the user saw "Unable to Reach Back-end" on the actual page. The AI checked Coolify's status reporting, not the rendered app. Also visible in the Dr Dave thread: "Note: it might take 10-15 seconds on the very first load for the DNS to propagate" — the AI hedged because it couldn't load the URL itself.

Why this matters for beta: every "I deployed it" claim is unverified unless the AI can open the URL. Sentry (planned in P2.3) catches errors after a user hits them. A browser tool catches errors before any user hits them.

Proposed fix: add a browser.* MCP tool surface backed by a headless Chromium running on the Coolify host (or in the vibn-dev container). Initial tools:

Tool	Purpose
`browser.navigate { url, timeoutMs? }`	Load the URL, return final URL + status code + page title
`browser.screenshot { url }`	Visual confirmation. Return base64 PNG (or store in GCS)
`browser.console_logs { url }`	Capture client-side JS errors (the `TypeError: reading 'z'/'j'/'aa'` from BETA P2.2 would be findable this way)
`browser.fetch { url, headers? }`	HTTP-level smoke test. Subset of `http_fetch` but always from inside Vibn's network

Implementation: Playwright already has an MCP server (@modelcontextprotocol/server-playwright). Wire it as a Coolify service, expose via the same per-workspace MCP token Vibn already issues.

Effort: ~4 hr. ~2 hr to deploy Playwright as a service, ~1 hr to add tool definitions, ~1 hr to wire prompt instructions ("after any deploy or dev_server.start, call browser.navigate to confirm").

Slot into: Phase 2 (Stability & visibility) — pairs with the runtime error chase (2.1, 2.2) and the Sentry wiring (2.3).

Gap 3 — Live UI state attached to chat messages

Failure observed: in the Dr Dave thread, user typed "are you able to give me a preview url?" The AI didn't know which port the Next.js dev server would bind to, what was already running, or whether the user was looking at the chat or another tab. It guessed and re-discovered everything from scratch.

In the Twenty thread, "can you see the different sections?" — user meant Plan tab sections (Vision/Tasks/Decisions/Ideas). AI listed metadata. No way to know.

Why prompt rules can't fix this: the AI literally lacks the information.

Proposed fix: the chat panel sends a small uiContext object alongside every user message. Inject into the system prompt as a dynamic block (same shape as activeBlock):

{
  currentRoute: "/mark-account/project/abc/hosting",
  currentTab: "hosting",
  visibleResources: [
    { kind: "app", uuid: "y4cs…", name: "vibn-frontend" },
    { kind: "service", uuid: "igcp…", name: "vibn-dev-twenty-crm" },
  ],
  lastUserActions: [
    { at: "2m ago", action: "opened twenty-crm logs" },
    { at: "5m ago", action: "switched to Hosting tab" },
  ],
}

System-prompt block becomes:

The user is currently looking at the Hosting tab (route: …/hosting). Visible resources: vibn-frontend, vibn-dev-twenty-crm. Recent actions: opened twenty-crm logs (2m ago), switched to Hosting (5m ago). When the user says "this" / "it" / "the URL" — assume they mean something visible in the current viewport unless they name something else.

Effort: ~3 hr. ~1 hr to wire the chat panel's uiContext collection (existing route + tab state, last 5 actions from a small ring buffer in the panel), ~1 hr to plumb through the chat API, ~1 hr to add the prompt block.

Slot into: Phase 3 (UX surfaces) — pairs with 3.2 (structured errors in chat) and 3.3 (empty-state nudges).

Gap 4 — Diff preview / accept-changes gate

Failure observed: none yet, but the surface is exposed today — fs_edit writes directly to /workspace in the dev container. For ephemeral exploration this is correct (sub-second iteration is the whole Path B point). For changes destined to ship, the user has no review surface; they only see what changed after the AI summarizes.

Why this matters for beta: the moment a paying user wants to "see what the AI changed before it goes live," there's nothing to show them. Cursor's whole UX is built on diffs the user accepts.

Proposed fix: two-mode fs_edit / fs_write:

Direct mode (default for dev container): write immediately. Current behavior. Fine for "make the button blue" iteration.
Staged mode (default when ship is the next likely action): write to a shadow path, surface a diff in the chat UI, gate the real write on a one-click "Accept" button.

The model decides which mode based on context — or simpler: stage when the file is in a "protected" set (e.g. prisma/schema.prisma, Dockerfile, package.json, anything in prod/ or migrations/), direct otherwise.

Effort: ~6 hr. ~2 hr backend (shadow write + apply endpoint), ~3 hr UI (diff renderer in the chat panel, accept/reject buttons), ~1 hr prompt + tool changes.

Slot into: Phase 4 (Onboarding & safety) — pairs with 4.5 (auth hardening) and 4.6 (compute quotas) as part of "what a stranger needs day 1."

Suggested sequencing

If we ship in priority order:

Gap 1 first — kills the worst pattern in prod for ~2 hr of work. Should be ahead of any new feature in Phase 2.
Gap 2 second — closes the verify-deploy loop. Multiplies the value of every subsequent AI-shipped change because it's no longer blind.
Gap 3 third — tighter conversational UX. Once 1 and 2 work, the remaining UX cliff is "AI doesn't know what I'm looking at."
Gap 4 last — only matters once we have paying users editing prod-bound code. Pre-beta optional.

Total effort to ship 1+2+3 (the meaningful UX wins): ~9 hours.

How this changes BETA_LAUNCH_PLAN.md

Two new tasks slot in:

P2.8 Tool-error recovery middleware (Gap 1) — block on nothing, ship before P2.4.
P2.9 Browser-driver MCP tool (Gap 2) — block on nothing.

One new task in P3:

P3.7 UI-state injection into chat (Gap 3) — block on nothing.

Gap 4 stays out of beta scope unless eval reveals real damage from unstaged edits.

9.9 KiB Raw Blame History