9.9 KiB
AI Harness Gaps — Proposal
Four gaps in the Vibn AI experience that are structural, not promptable. Each one is responsible for a specific failure pattern visible in real production chat transcripts. None of them are scoped in
AI_PATH_B_EXECUTION_PLAN.md,BETA_LAUNCH_PLAN.md,AI_CAPABILITIES_ROADMAP.md, or the agent-execution / telemetry-streaming designs.Drafted: 2026-04-30 (after a transcript review of the Dr Dave + Twenty CRM threads).
Why these four: they share a common shape — the model is doing what the prompt told it to, and still producing a bad outcome. The fix lives in the harness around the model, not in instructions to the model.
TL;DR
| # | Gap | Failure pattern in prod | Fix size |
|---|---|---|---|
| 1 | Tool-error recovery middleware | Orphan twenty-* services (4 shipped). Model keeps delete-and-recreating despite explicit prompt rule against it. | ~2 hr |
| 2 | Browser-driver tool for the AI | "Should be live in 10s" — AI ships URLs without ever loading them; user discovers the 502. | ~4 hr |
| 3 | Live UI state attached to chat messages | "this isn't working" / "fix the URL" with no signal of which "this". AI guesses, often wrong. | ~3 hr |
| 4 | Diff preview / accept-changes gate | fs_edit writes straight to the dev container with no review surface. Fine for sub-second iteration; bad for prod-bound edits. |
~6 hr |
Total: ~15 hr of work. None require new infra.
Gap 1 — Tool-error recovery middleware (highest ROI)
Failure observed: in thread d698ef40-… ("Hey there, what can you see about this project?"), the AI hit
Conflict. The container name "/postgres-…" is already in use three separate times.
On each attempt it responded by creating a new service with a new name,
not by calling apps_unstick. The prompt explicitly tells it not to do
this and tells it the recovery sequence. The model still did it.
Why prompt rules fail here: the model treats the system prompt as soft guidance against a 30k-token document; the tool result is concrete and 200ms-fresh. When tool reality contradicts prompt rules, tool reality wins.
Proposed fix: middleware in executeMcpTool that pattern-matches
known-recoverable errors and injects a synthetic system message into
the conversation before the next round. The model can't ignore an
injected instruction the way it can ignore a static prompt rule.
// In app/api/chat/route.ts, around the executeMcpTool call:
const errorRecovery = detectKnownError(result);
if (errorRecovery) {
messages.push({
role: "system",
content: `[RECOVERY] ${errorRecovery.diagnosis}. Required next action: ${errorRecovery.fix}. Do NOT ${errorRecovery.antipattern}.`,
});
}
Initial recovery rules (high-confidence, low-false-positive):
| Error signature | Diagnosis | Fix | Antipattern |
|---|---|---|---|
Conflict. The container name … is already in use |
Orphan container blocking new boot | apps_unstick { uuid } then apps_deploy { uuid } |
Delete and recreate with a new name |
pull access denied / manifest unknown |
Image not on the host yet | apps_repair { uuid } |
Retry deploy without addressing the cause |
port … is already allocated |
Another container holds the port | List containers, identify holder, decide | Pick a random different port |
Effort: ~2 hr. New file lib/ai/error-recovery.ts with a registry of
patterns + the injection in the chat route. Each rule is ~10 lines.
Slot into: BETA_LAUNCH_PLAN.md Phase 2 (Stability & visibility) — fits next to 2.4 (deployment-failed webhook).
Gap 2 — Browser-driver tool for the AI
Failure observed: in the same Twenty thread, the AI said "It's fully deployed, healthy, and I've verified it's returning a 200 OK status" — but the user saw "Unable to Reach Back-end" on the actual page. The AI checked Coolify's status reporting, not the rendered app. Also visible in the Dr Dave thread: "Note: it might take 10-15 seconds on the very first load for the DNS to propagate" — the AI hedged because it couldn't load the URL itself.
Why this matters for beta: every "I deployed it" claim is unverified unless the AI can open the URL. Sentry (planned in P2.3) catches errors after a user hits them. A browser tool catches errors before any user hits them.
Proposed fix: add a browser.* MCP tool surface backed by a
headless Chromium running on the Coolify host (or in the vibn-dev
container). Initial tools:
| Tool | Purpose |
|---|---|
browser.navigate { url, timeoutMs? } |
Load the URL, return final URL + status code + page title |
browser.screenshot { url } |
Visual confirmation. Return base64 PNG (or store in GCS) |
browser.console_logs { url } |
Capture client-side JS errors (the TypeError: reading 'z'/'j'/'aa' from BETA P2.2 would be findable this way) |
browser.fetch { url, headers? } |
HTTP-level smoke test. Subset of http_fetch but always from inside Vibn's network |
Implementation: Playwright already has an MCP server (@modelcontextprotocol/server-playwright).
Wire it as a Coolify service, expose via the same per-workspace MCP
token Vibn already issues.
Effort: ~4 hr. ~2 hr to deploy Playwright as a service, ~1 hr to
add tool definitions, ~1 hr to wire prompt instructions ("after any
deploy or dev_server.start, call browser.navigate to confirm").
Slot into: Phase 2 (Stability & visibility) — pairs with the runtime error chase (2.1, 2.2) and the Sentry wiring (2.3).
Gap 3 — Live UI state attached to chat messages
Failure observed: in the Dr Dave thread, user typed "are you able to give me a preview url?" The AI didn't know which port the Next.js dev server would bind to, what was already running, or whether the user was looking at the chat or another tab. It guessed and re-discovered everything from scratch.
In the Twenty thread, "can you see the different sections?" — user meant Plan tab sections (Vision/Tasks/Decisions/Ideas). AI listed metadata. No way to know.
Why prompt rules can't fix this: the AI literally lacks the information.
Proposed fix: the chat panel sends a small uiContext object
alongside every user message. Inject into the system prompt as a
dynamic block (same shape as activeBlock):
{
currentRoute: "/mark-account/project/abc/hosting",
currentTab: "hosting",
visibleResources: [
{ kind: "app", uuid: "y4cs…", name: "vibn-frontend" },
{ kind: "service", uuid: "igcp…", name: "vibn-dev-twenty-crm" },
],
lastUserActions: [
{ at: "2m ago", action: "opened twenty-crm logs" },
{ at: "5m ago", action: "switched to Hosting tab" },
],
}
System-prompt block becomes:
The user is currently looking at the Hosting tab (route:
…/hosting). Visible resources:vibn-frontend,vibn-dev-twenty-crm. Recent actions: opened twenty-crm logs (2m ago), switched to Hosting (5m ago). When the user says "this" / "it" / "the URL" — assume they mean something visible in the current viewport unless they name something else.
Effort: ~3 hr. ~1 hr to wire the chat panel's
uiContext collection (existing route + tab state, last 5 actions
from a small ring buffer in the panel), ~1 hr to plumb through the
chat API, ~1 hr to add the prompt block.
Slot into: Phase 3 (UX surfaces) — pairs with 3.2 (structured errors in chat) and 3.3 (empty-state nudges).
Gap 4 — Diff preview / accept-changes gate
Failure observed: none yet, but the surface is exposed today —
fs_edit writes directly to /workspace in the dev container. For
ephemeral exploration this is correct (sub-second iteration is the
whole Path B point). For changes destined to ship, the user has no
review surface; they only see what changed after the AI summarizes.
Why this matters for beta: the moment a paying user wants to "see what the AI changed before it goes live," there's nothing to show them. Cursor's whole UX is built on diffs the user accepts.
Proposed fix: two-mode fs_edit / fs_write:
- Direct mode (default for dev container): write immediately. Current behavior. Fine for "make the button blue" iteration.
- Staged mode (default when
shipis the next likely action): write to a shadow path, surface a diff in the chat UI, gate the real write on a one-click "Accept" button.
The model decides which mode based on context — or simpler: stage when
the file is in a "protected" set (e.g. prisma/schema.prisma,
Dockerfile, package.json, anything in prod/ or migrations/),
direct otherwise.
Effort: ~6 hr. ~2 hr backend (shadow write + apply endpoint), ~3 hr UI (diff renderer in the chat panel, accept/reject buttons), ~1 hr prompt + tool changes.
Slot into: Phase 4 (Onboarding & safety) — pairs with 4.5 (auth hardening) and 4.6 (compute quotas) as part of "what a stranger needs day 1."
Suggested sequencing
If we ship in priority order:
- Gap 1 first — kills the worst pattern in prod for ~2 hr of work. Should be ahead of any new feature in Phase 2.
- Gap 2 second — closes the verify-deploy loop. Multiplies the value of every subsequent AI-shipped change because it's no longer blind.
- Gap 3 third — tighter conversational UX. Once 1 and 2 work, the remaining UX cliff is "AI doesn't know what I'm looking at."
- Gap 4 last — only matters once we have paying users editing prod-bound code. Pre-beta optional.
Total effort to ship 1+2+3 (the meaningful UX wins): ~9 hours.
How this changes BETA_LAUNCH_PLAN.md
Two new tasks slot in:
- P2.8 Tool-error recovery middleware (Gap 1) — block on nothing, ship before P2.4.
- P2.9 Browser-driver MCP tool (Gap 2) — block on nothing.
One new task in P3:
- P3.7 UI-state injection into chat (Gap 3) — block on nothing.
Gap 4 stays out of beta scope unless eval reveals real damage from unstaged edits.