master-ai/AI_HARNESS_GAPS.md

# AI Harness Gaps — Proposal

> Four gaps in the Vibn AI experience that are **structural, not promptable**.
> Each one is responsible for a specific failure pattern visible in real
> production chat transcripts. None of them are scoped in
> [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md),
> [`BETA_LAUNCH_PLAN.md`](./BETA_LAUNCH_PLAN.md),
> [`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md), or the
> agent-execution / telemetry-streaming designs.
>
> **Drafted:** 2026-04-30 (after a transcript review of the Dr Dave + Twenty CRM threads).
>
> **Why these four:** they share a common shape — the model is doing what
> the prompt told it to, and still producing a bad outcome. The fix lives
> in the *harness around the model*, not in instructions to the model.

---

## TL;DR

| # | Gap | Failure pattern in prod | Fix size |
|---|---|---|---|
| 1 | Tool-error recovery middleware | Orphan twenty-* services (4 shipped). Model keeps delete-and-recreating despite explicit prompt rule against it. | ~2 hr |
| 2 | Browser-driver tool for the AI | "Should be live in 10s" — AI ships URLs without ever loading them; user discovers the 502. | ~4 hr |
| 3 | Live UI state attached to chat messages | "this isn't working" / "fix the URL" with no signal of which "this". AI guesses, often wrong. | ~3 hr |
| 4 | Diff preview / accept-changes gate | `fs_edit` writes straight to the dev container with no review surface. Fine for sub-second iteration; bad for prod-bound edits. | ~6 hr |

Total: ~15 hr of work. None require new infra.

---

## Gap 1 — Tool-error recovery middleware (highest ROI)

**Failure observed:** in thread `d698ef40-…` ("Hey there, what can you see about this project?"), the AI hit
`Conflict. The container name "/postgres-…" is already in use` **three separate times**.
On each attempt it responded by *creating a new service with a new name*,
not by calling `apps_unstick`. The prompt explicitly tells it not to do
this and tells it the recovery sequence. The model still did it.

**Why prompt rules fail here:** the model treats the system prompt as
soft guidance against a 30k-token document; the tool result is concrete
and 200ms-fresh. When tool reality contradicts prompt rules, tool
reality wins.

**Proposed fix:** middleware in `executeMcpTool` that pattern-matches
known-recoverable errors and **injects a synthetic system message** into
the conversation before the next round. The model can't ignore an
injected instruction the way it can ignore a static prompt rule.

```ts
// In app/api/chat/route.ts, around the executeMcpTool call:
const errorRecovery = detectKnownError(result);
if (errorRecovery) {
  messages.push({
    role: "system",
    content: `[RECOVERY] ${errorRecovery.diagnosis}. Required next action: ${errorRecovery.fix}. Do NOT ${errorRecovery.antipattern}.`,
  });
}
```

**Initial recovery rules** (high-confidence, low-false-positive):

| Error signature | Diagnosis | Fix | Antipattern |
|---|---|---|---|
| `Conflict. The container name … is already in use` | Orphan container blocking new boot | `apps_unstick { uuid }` then `apps_deploy { uuid }` | Delete and recreate with a new name |
| `pull access denied` / `manifest unknown` | Image not on the host yet | `apps_repair { uuid }` | Retry deploy without addressing the cause |
| `port … is already allocated` | Another container holds the port | List containers, identify holder, decide | Pick a random different port |

**Effort:** ~2 hr. New file `lib/ai/error-recovery.ts` with a registry of
patterns + the injection in the chat route. Each rule is ~10 lines.

**Slot into:** `BETA_LAUNCH_PLAN.md` Phase 2 (Stability & visibility) — fits next to 2.4 (deployment-failed webhook).

---

## Gap 2 — Browser-driver tool for the AI

**Failure observed:** in the same Twenty thread, the AI said *"It's
fully deployed, healthy, and I've verified it's returning a 200 OK
status"* — but the user saw "Unable to Reach Back-end" on the actual
page. The AI checked Coolify's status reporting, not the rendered app.
Also visible in the Dr Dave thread: *"Note: it might take 10-15 seconds
on the very first load for the DNS to propagate"* — the AI hedged
because it couldn't load the URL itself.

**Why this matters for beta:** every "I deployed it" claim is unverified
unless the AI can open the URL. Sentry (planned in P2.3) catches
errors *after a user hits them*. A browser tool catches errors
*before any user hits them*.

**Proposed fix:** add a `browser.*` MCP tool surface backed by a
headless Chromium running on the Coolify host (or in the vibn-dev
container). Initial tools:

| Tool | Purpose |
|---|---|
| `browser.navigate { url, timeoutMs? }` | Load the URL, return final URL + status code + page title |
| `browser.screenshot { url }` | Visual confirmation. Return base64 PNG (or store in GCS) |
| `browser.console_logs { url }` | Capture client-side JS errors (the `TypeError: reading 'z'/'j'/'aa'` from BETA P2.2 would be findable this way) |
| `browser.fetch { url, headers? }` | HTTP-level smoke test. Subset of `http_fetch` but always from inside Vibn's network |

**Implementation:** Playwright already has an MCP server (`@modelcontextprotocol/server-playwright`).
Wire it as a Coolify service, expose via the same per-workspace MCP
token Vibn already issues.

**Effort:** ~4 hr. ~2 hr to deploy Playwright as a service, ~1 hr to
add tool definitions, ~1 hr to wire prompt instructions ("after any
deploy or `dev_server.start`, call `browser.navigate` to confirm").

**Slot into:** Phase 2 (Stability & visibility) — pairs with the
runtime error chase (2.1, 2.2) and the Sentry wiring (2.3).

---

## Gap 3 — Live UI state attached to chat messages

**Failure observed:** in the Dr Dave thread, user typed *"are you able
to give me a preview url?"* The AI didn't know which port the
Next.js dev server would bind to, what was already running, or
whether the user was looking at the chat or another tab. It
guessed and re-discovered everything from scratch.

In the Twenty thread, *"can you see the different sections?"* — user
meant Plan tab sections (Vision/Tasks/Decisions/Ideas). AI listed
metadata. No way to know.

**Why prompt rules can't fix this:** the AI literally lacks the
information.

**Proposed fix:** the chat panel sends a small `uiContext` object
alongside every user message. Inject into the system prompt as a
dynamic block (same shape as `activeBlock`):

```ts
{
  currentRoute: "/mark-account/project/abc/hosting",
  currentTab: "hosting",
  visibleResources: [
    { kind: "app", uuid: "y4cs…", name: "vibn-frontend" },
    { kind: "service", uuid: "igcp…", name: "vibn-dev-twenty-crm" },
  ],
  lastUserActions: [
    { at: "2m ago", action: "opened twenty-crm logs" },
    { at: "5m ago", action: "switched to Hosting tab" },
  ],
}
```

System-prompt block becomes:

> The user is currently looking at the **Hosting tab** (route: `…/hosting`).
> Visible resources: `vibn-frontend`, `vibn-dev-twenty-crm`.
> Recent actions: opened twenty-crm logs (2m ago), switched to Hosting (5m ago).
> When the user says "this" / "it" / "the URL" — assume they mean
> something visible in the current viewport unless they name something else.

**Effort:** ~3 hr. ~1 hr to wire the chat panel's
`uiContext` collection (existing route + tab state, last 5 actions
from a small ring buffer in the panel), ~1 hr to plumb through the
chat API, ~1 hr to add the prompt block.

**Slot into:** Phase 3 (UX surfaces) — pairs with 3.2 (structured
errors in chat) and 3.3 (empty-state nudges).

---

## Gap 4 — Diff preview / accept-changes gate

**Failure observed:** none yet, but the surface is exposed today —
`fs_edit` writes directly to `/workspace` in the dev container. For
ephemeral exploration this is correct (sub-second iteration is the
whole Path B point). For changes destined to ship, the user has no
review surface; they only see what changed after the AI summarizes.

**Why this matters for beta:** the moment a paying user wants to
"see what the AI changed before it goes live," there's nothing to
show them. Cursor's whole UX is built on diffs the user accepts.

**Proposed fix:** two-mode `fs_edit` / `fs_write`:

1. **Direct mode (default for dev container):** write immediately. Current
   behavior. Fine for "make the button blue" iteration.
2. **Staged mode (default when `ship` is the next likely action):**
   write to a shadow path, surface a diff in the chat UI, gate the
   real write on a one-click "Accept" button.

The model decides which mode based on context — or simpler: stage when
the file is in a "protected" set (e.g. `prisma/schema.prisma`,
`Dockerfile`, `package.json`, anything in `prod/` or `migrations/`),
direct otherwise.

**Effort:** ~6 hr. ~2 hr backend (shadow write + apply endpoint),
~3 hr UI (diff renderer in the chat panel, accept/reject buttons),
~1 hr prompt + tool changes.

**Slot into:** Phase 4 (Onboarding & safety) — pairs with 4.5 (auth
hardening) and 4.6 (compute quotas) as part of "what a stranger
needs day 1."

---

## Suggested sequencing

If we ship in priority order:

1. **Gap 1 first** — kills the worst pattern in prod for ~2 hr of work. Should be ahead of any new feature in Phase 2.
2. **Gap 2 second** — closes the verify-deploy loop. Multiplies the value of every subsequent AI-shipped change because it's no longer blind.
3. **Gap 3 third** — tighter conversational UX. Once 1 and 2 work, the remaining UX cliff is "AI doesn't know what I'm looking at."
4. **Gap 4 last** — only matters once we have paying users editing prod-bound code. Pre-beta optional.

Total effort to ship 1+2+3 (the meaningful UX wins): **~9 hours.**

---

## How this changes BETA_LAUNCH_PLAN.md

Two new tasks slot in:

- **P2.8** Tool-error recovery middleware (Gap 1) — block on nothing, ship before P2.4.
- **P2.9** Browser-driver MCP tool (Gap 2) — block on nothing.

One new task in P3:

- **P3.7** UI-state injection into chat (Gap 3) — block on nothing.

Gap 4 stays out of beta scope unless eval reveals real damage from
unstaged edits.