docs: add AI harness gaps proposal — orphan-recovery, browser tool, UI state, diff preview
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
227
AI_HARNESS_GAPS.md
Normal file
227
AI_HARNESS_GAPS.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# AI Harness Gaps — Proposal
|
||||
|
||||
> Four gaps in the Vibn AI experience that are **structural, not promptable**.
|
||||
> Each one is responsible for a specific failure pattern visible in real
|
||||
> production chat transcripts. None of them are scoped in
|
||||
> [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md),
|
||||
> [`BETA_LAUNCH_PLAN.md`](./BETA_LAUNCH_PLAN.md),
|
||||
> [`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md), or the
|
||||
> agent-execution / telemetry-streaming designs.
|
||||
>
|
||||
> **Drafted:** 2026-04-30 (after a transcript review of the Dr Dave + Twenty CRM threads).
|
||||
>
|
||||
> **Why these four:** they share a common shape — the model is doing what
|
||||
> the prompt told it to, and still producing a bad outcome. The fix lives
|
||||
> in the *harness around the model*, not in instructions to the model.
|
||||
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
| # | Gap | Failure pattern in prod | Fix size |
|
||||
|---|---|---|---|
|
||||
| 1 | Tool-error recovery middleware | Orphan twenty-* services (4 shipped). Model keeps delete-and-recreating despite explicit prompt rule against it. | ~2 hr |
|
||||
| 2 | Browser-driver tool for the AI | "Should be live in 10s" — AI ships URLs without ever loading them; user discovers the 502. | ~4 hr |
|
||||
| 3 | Live UI state attached to chat messages | "this isn't working" / "fix the URL" with no signal of which "this". AI guesses, often wrong. | ~3 hr |
|
||||
| 4 | Diff preview / accept-changes gate | `fs_edit` writes straight to the dev container with no review surface. Fine for sub-second iteration; bad for prod-bound edits. | ~6 hr |
|
||||
|
||||
Total: ~15 hr of work. None require new infra.
|
||||
|
||||
---
|
||||
|
||||
## Gap 1 — Tool-error recovery middleware (highest ROI)
|
||||
|
||||
**Failure observed:** in thread `d698ef40-…` ("Hey there, what can you see about this project?"), the AI hit
|
||||
`Conflict. The container name "/postgres-…" is already in use` **three separate times**.
|
||||
On each attempt it responded by *creating a new service with a new name*,
|
||||
not by calling `apps_unstick`. The prompt explicitly tells it not to do
|
||||
this and tells it the recovery sequence. The model still did it.
|
||||
|
||||
**Why prompt rules fail here:** the model treats the system prompt as
|
||||
soft guidance against a 30k-token document; the tool result is concrete
|
||||
and 200ms-fresh. When tool reality contradicts prompt rules, tool
|
||||
reality wins.
|
||||
|
||||
**Proposed fix:** middleware in `executeMcpTool` that pattern-matches
|
||||
known-recoverable errors and **injects a synthetic system message** into
|
||||
the conversation before the next round. The model can't ignore an
|
||||
injected instruction the way it can ignore a static prompt rule.
|
||||
|
||||
```ts
|
||||
// In app/api/chat/route.ts, around the executeMcpTool call:
|
||||
const errorRecovery = detectKnownError(result);
|
||||
if (errorRecovery) {
|
||||
messages.push({
|
||||
role: "system",
|
||||
content: `[RECOVERY] ${errorRecovery.diagnosis}. Required next action: ${errorRecovery.fix}. Do NOT ${errorRecovery.antipattern}.`,
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Initial recovery rules** (high-confidence, low-false-positive):
|
||||
|
||||
| Error signature | Diagnosis | Fix | Antipattern |
|
||||
|---|---|---|---|
|
||||
| `Conflict. The container name … is already in use` | Orphan container blocking new boot | `apps_unstick { uuid }` then `apps_deploy { uuid }` | Delete and recreate with a new name |
|
||||
| `pull access denied` / `manifest unknown` | Image not on the host yet | `apps_repair { uuid }` | Retry deploy without addressing the cause |
|
||||
| `port … is already allocated` | Another container holds the port | List containers, identify holder, decide | Pick a random different port |
|
||||
|
||||
**Effort:** ~2 hr. New file `lib/ai/error-recovery.ts` with a registry of
|
||||
patterns + the injection in the chat route. Each rule is ~10 lines.
|
||||
|
||||
**Slot into:** `BETA_LAUNCH_PLAN.md` Phase 2 (Stability & visibility) — fits next to 2.4 (deployment-failed webhook).
|
||||
|
||||
---
|
||||
|
||||
## Gap 2 — Browser-driver tool for the AI
|
||||
|
||||
**Failure observed:** in the same Twenty thread, the AI said *"It's
|
||||
fully deployed, healthy, and I've verified it's returning a 200 OK
|
||||
status"* — but the user saw "Unable to Reach Back-end" on the actual
|
||||
page. The AI checked Coolify's status reporting, not the rendered app.
|
||||
Also visible in the Dr Dave thread: *"Note: it might take 10-15 seconds
|
||||
on the very first load for the DNS to propagate"* — the AI hedged
|
||||
because it couldn't load the URL itself.
|
||||
|
||||
**Why this matters for beta:** every "I deployed it" claim is unverified
|
||||
unless the AI can open the URL. Sentry (planned in P2.3) catches
|
||||
errors *after a user hits them*. A browser tool catches errors
|
||||
*before any user hits them*.
|
||||
|
||||
**Proposed fix:** add a `browser.*` MCP tool surface backed by a
|
||||
headless Chromium running on the Coolify host (or in the vibn-dev
|
||||
container). Initial tools:
|
||||
|
||||
| Tool | Purpose |
|
||||
|---|---|
|
||||
| `browser.navigate { url, timeoutMs? }` | Load the URL, return final URL + status code + page title |
|
||||
| `browser.screenshot { url }` | Visual confirmation. Return base64 PNG (or store in GCS) |
|
||||
| `browser.console_logs { url }` | Capture client-side JS errors (the `TypeError: reading 'z'/'j'/'aa'` from BETA P2.2 would be findable this way) |
|
||||
| `browser.fetch { url, headers? }` | HTTP-level smoke test. Subset of `http_fetch` but always from inside Vibn's network |
|
||||
|
||||
**Implementation:** Playwright already has an MCP server (`@modelcontextprotocol/server-playwright`).
|
||||
Wire it as a Coolify service, expose via the same per-workspace MCP
|
||||
token Vibn already issues.
|
||||
|
||||
**Effort:** ~4 hr. ~2 hr to deploy Playwright as a service, ~1 hr to
|
||||
add tool definitions, ~1 hr to wire prompt instructions ("after any
|
||||
deploy or `dev_server.start`, call `browser.navigate` to confirm").
|
||||
|
||||
**Slot into:** Phase 2 (Stability & visibility) — pairs with the
|
||||
runtime error chase (2.1, 2.2) and the Sentry wiring (2.3).
|
||||
|
||||
---
|
||||
|
||||
## Gap 3 — Live UI state attached to chat messages
|
||||
|
||||
**Failure observed:** in the Dr Dave thread, user typed *"are you able
|
||||
to give me a preview url?"* The AI didn't know which port the
|
||||
Next.js dev server would bind to, what was already running, or
|
||||
whether the user was looking at the chat or another tab. It
|
||||
guessed and re-discovered everything from scratch.
|
||||
|
||||
In the Twenty thread, *"can you see the different sections?"* — user
|
||||
meant Plan tab sections (Vision/Tasks/Decisions/Ideas). AI listed
|
||||
metadata. No way to know.
|
||||
|
||||
**Why prompt rules can't fix this:** the AI literally lacks the
|
||||
information.
|
||||
|
||||
**Proposed fix:** the chat panel sends a small `uiContext` object
|
||||
alongside every user message. Inject into the system prompt as a
|
||||
dynamic block (same shape as `activeBlock`):
|
||||
|
||||
```ts
|
||||
{
|
||||
currentRoute: "/mark-account/project/abc/hosting",
|
||||
currentTab: "hosting",
|
||||
visibleResources: [
|
||||
{ kind: "app", uuid: "y4cs…", name: "vibn-frontend" },
|
||||
{ kind: "service", uuid: "igcp…", name: "vibn-dev-twenty-crm" },
|
||||
],
|
||||
lastUserActions: [
|
||||
{ at: "2m ago", action: "opened twenty-crm logs" },
|
||||
{ at: "5m ago", action: "switched to Hosting tab" },
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
System-prompt block becomes:
|
||||
|
||||
> The user is currently looking at the **Hosting tab** (route: `…/hosting`).
|
||||
> Visible resources: `vibn-frontend`, `vibn-dev-twenty-crm`.
|
||||
> Recent actions: opened twenty-crm logs (2m ago), switched to Hosting (5m ago).
|
||||
> When the user says "this" / "it" / "the URL" — assume they mean
|
||||
> something visible in the current viewport unless they name something else.
|
||||
|
||||
**Effort:** ~3 hr. ~1 hr to wire the chat panel's
|
||||
`uiContext` collection (existing route + tab state, last 5 actions
|
||||
from a small ring buffer in the panel), ~1 hr to plumb through the
|
||||
chat API, ~1 hr to add the prompt block.
|
||||
|
||||
**Slot into:** Phase 3 (UX surfaces) — pairs with 3.2 (structured
|
||||
errors in chat) and 3.3 (empty-state nudges).
|
||||
|
||||
---
|
||||
|
||||
## Gap 4 — Diff preview / accept-changes gate
|
||||
|
||||
**Failure observed:** none yet, but the surface is exposed today —
|
||||
`fs_edit` writes directly to `/workspace` in the dev container. For
|
||||
ephemeral exploration this is correct (sub-second iteration is the
|
||||
whole Path B point). For changes destined to ship, the user has no
|
||||
review surface; they only see what changed after the AI summarizes.
|
||||
|
||||
**Why this matters for beta:** the moment a paying user wants to
|
||||
"see what the AI changed before it goes live," there's nothing to
|
||||
show them. Cursor's whole UX is built on diffs the user accepts.
|
||||
|
||||
**Proposed fix:** two-mode `fs_edit` / `fs_write`:
|
||||
|
||||
1. **Direct mode (default for dev container):** write immediately. Current
|
||||
behavior. Fine for "make the button blue" iteration.
|
||||
2. **Staged mode (default when `ship` is the next likely action):**
|
||||
write to a shadow path, surface a diff in the chat UI, gate the
|
||||
real write on a one-click "Accept" button.
|
||||
|
||||
The model decides which mode based on context — or simpler: stage when
|
||||
the file is in a "protected" set (e.g. `prisma/schema.prisma`,
|
||||
`Dockerfile`, `package.json`, anything in `prod/` or `migrations/`),
|
||||
direct otherwise.
|
||||
|
||||
**Effort:** ~6 hr. ~2 hr backend (shadow write + apply endpoint),
|
||||
~3 hr UI (diff renderer in the chat panel, accept/reject buttons),
|
||||
~1 hr prompt + tool changes.
|
||||
|
||||
**Slot into:** Phase 4 (Onboarding & safety) — pairs with 4.5 (auth
|
||||
hardening) and 4.6 (compute quotas) as part of "what a stranger
|
||||
needs day 1."
|
||||
|
||||
---
|
||||
|
||||
## Suggested sequencing
|
||||
|
||||
If we ship in priority order:
|
||||
|
||||
1. **Gap 1 first** — kills the worst pattern in prod for ~2 hr of work. Should be ahead of any new feature in Phase 2.
|
||||
2. **Gap 2 second** — closes the verify-deploy loop. Multiplies the value of every subsequent AI-shipped change because it's no longer blind.
|
||||
3. **Gap 3 third** — tighter conversational UX. Once 1 and 2 work, the remaining UX cliff is "AI doesn't know what I'm looking at."
|
||||
4. **Gap 4 last** — only matters once we have paying users editing prod-bound code. Pre-beta optional.
|
||||
|
||||
Total effort to ship 1+2+3 (the meaningful UX wins): **~9 hours.**
|
||||
|
||||
---
|
||||
|
||||
## How this changes BETA_LAUNCH_PLAN.md
|
||||
|
||||
Two new tasks slot in:
|
||||
|
||||
- **P2.8** Tool-error recovery middleware (Gap 1) — block on nothing, ship before P2.4.
|
||||
- **P2.9** Browser-driver MCP tool (Gap 2) — block on nothing.
|
||||
|
||||
One new task in P3:
|
||||
|
||||
- **P3.7** UI-state injection into chat (Gap 3) — block on nothing.
|
||||
|
||||
Gap 4 stays out of beta scope unless eval reveals real damage from
|
||||
unstaged edits.
|
||||
Reference in New Issue
Block a user