228 lines
9.9 KiB
Markdown
228 lines
9.9 KiB
Markdown
# AI Harness Gaps — Proposal
|
|
|
|
> Four gaps in the Vibn AI experience that are **structural, not promptable**.
|
|
> Each one is responsible for a specific failure pattern visible in real
|
|
> production chat transcripts. None of them are scoped in
|
|
> [`AI_PATH_B_EXECUTION_PLAN.md`](./AI_PATH_B_EXECUTION_PLAN.md),
|
|
> [`BETA_LAUNCH_PLAN.md`](./BETA_LAUNCH_PLAN.md),
|
|
> [`AI_CAPABILITIES_ROADMAP.md`](./AI_CAPABILITIES_ROADMAP.md), or the
|
|
> agent-execution / telemetry-streaming designs.
|
|
>
|
|
> **Drafted:** 2026-04-30 (after a transcript review of the Dr Dave + Twenty CRM threads).
|
|
>
|
|
> **Why these four:** they share a common shape — the model is doing what
|
|
> the prompt told it to, and still producing a bad outcome. The fix lives
|
|
> in the *harness around the model*, not in instructions to the model.
|
|
|
|
---
|
|
|
|
## TL;DR
|
|
|
|
| # | Gap | Failure pattern in prod | Fix size |
|
|
|---|---|---|---|
|
|
| 1 | Tool-error recovery middleware | Orphan twenty-* services (4 shipped). Model keeps delete-and-recreating despite explicit prompt rule against it. | ~2 hr |
|
|
| 2 | Browser-driver tool for the AI | "Should be live in 10s" — AI ships URLs without ever loading them; user discovers the 502. | ~4 hr |
|
|
| 3 | Live UI state attached to chat messages | "this isn't working" / "fix the URL" with no signal of which "this". AI guesses, often wrong. | ~3 hr |
|
|
| 4 | Diff preview / accept-changes gate | `fs_edit` writes straight to the dev container with no review surface. Fine for sub-second iteration; bad for prod-bound edits. | ~6 hr |
|
|
|
|
Total: ~15 hr of work. None require new infra.
|
|
|
|
---
|
|
|
|
## Gap 1 — Tool-error recovery middleware (highest ROI)
|
|
|
|
**Failure observed:** in thread `d698ef40-…` ("Hey there, what can you see about this project?"), the AI hit
|
|
`Conflict. The container name "/postgres-…" is already in use` **three separate times**.
|
|
On each attempt it responded by *creating a new service with a new name*,
|
|
not by calling `apps_unstick`. The prompt explicitly tells it not to do
|
|
this and tells it the recovery sequence. The model still did it.
|
|
|
|
**Why prompt rules fail here:** the model treats the system prompt as
|
|
soft guidance against a 30k-token document; the tool result is concrete
|
|
and 200ms-fresh. When tool reality contradicts prompt rules, tool
|
|
reality wins.
|
|
|
|
**Proposed fix:** middleware in `executeMcpTool` that pattern-matches
|
|
known-recoverable errors and **injects a synthetic system message** into
|
|
the conversation before the next round. The model can't ignore an
|
|
injected instruction the way it can ignore a static prompt rule.
|
|
|
|
```ts
|
|
// In app/api/chat/route.ts, around the executeMcpTool call:
|
|
const errorRecovery = detectKnownError(result);
|
|
if (errorRecovery) {
|
|
messages.push({
|
|
role: "system",
|
|
content: `[RECOVERY] ${errorRecovery.diagnosis}. Required next action: ${errorRecovery.fix}. Do NOT ${errorRecovery.antipattern}.`,
|
|
});
|
|
}
|
|
```
|
|
|
|
**Initial recovery rules** (high-confidence, low-false-positive):
|
|
|
|
| Error signature | Diagnosis | Fix | Antipattern |
|
|
|---|---|---|---|
|
|
| `Conflict. The container name … is already in use` | Orphan container blocking new boot | `apps_unstick { uuid }` then `apps_deploy { uuid }` | Delete and recreate with a new name |
|
|
| `pull access denied` / `manifest unknown` | Image not on the host yet | `apps_repair { uuid }` | Retry deploy without addressing the cause |
|
|
| `port … is already allocated` | Another container holds the port | List containers, identify holder, decide | Pick a random different port |
|
|
|
|
**Effort:** ~2 hr. New file `lib/ai/error-recovery.ts` with a registry of
|
|
patterns + the injection in the chat route. Each rule is ~10 lines.
|
|
|
|
**Slot into:** `BETA_LAUNCH_PLAN.md` Phase 2 (Stability & visibility) — fits next to 2.4 (deployment-failed webhook).
|
|
|
|
---
|
|
|
|
## Gap 2 — Browser-driver tool for the AI
|
|
|
|
**Failure observed:** in the same Twenty thread, the AI said *"It's
|
|
fully deployed, healthy, and I've verified it's returning a 200 OK
|
|
status"* — but the user saw "Unable to Reach Back-end" on the actual
|
|
page. The AI checked Coolify's status reporting, not the rendered app.
|
|
Also visible in the Dr Dave thread: *"Note: it might take 10-15 seconds
|
|
on the very first load for the DNS to propagate"* — the AI hedged
|
|
because it couldn't load the URL itself.
|
|
|
|
**Why this matters for beta:** every "I deployed it" claim is unverified
|
|
unless the AI can open the URL. Sentry (planned in P2.3) catches
|
|
errors *after a user hits them*. A browser tool catches errors
|
|
*before any user hits them*.
|
|
|
|
**Proposed fix:** add a `browser.*` MCP tool surface backed by a
|
|
headless Chromium running on the Coolify host (or in the vibn-dev
|
|
container). Initial tools:
|
|
|
|
| Tool | Purpose |
|
|
|---|---|
|
|
| `browser.navigate { url, timeoutMs? }` | Load the URL, return final URL + status code + page title |
|
|
| `browser.screenshot { url }` | Visual confirmation. Return base64 PNG (or store in GCS) |
|
|
| `browser.console_logs { url }` | Capture client-side JS errors (the `TypeError: reading 'z'/'j'/'aa'` from BETA P2.2 would be findable this way) |
|
|
| `browser.fetch { url, headers? }` | HTTP-level smoke test. Subset of `http_fetch` but always from inside Vibn's network |
|
|
|
|
**Implementation:** Playwright already has an MCP server (`@modelcontextprotocol/server-playwright`).
|
|
Wire it as a Coolify service, expose via the same per-workspace MCP
|
|
token Vibn already issues.
|
|
|
|
**Effort:** ~4 hr. ~2 hr to deploy Playwright as a service, ~1 hr to
|
|
add tool definitions, ~1 hr to wire prompt instructions ("after any
|
|
deploy or `dev_server.start`, call `browser.navigate` to confirm").
|
|
|
|
**Slot into:** Phase 2 (Stability & visibility) — pairs with the
|
|
runtime error chase (2.1, 2.2) and the Sentry wiring (2.3).
|
|
|
|
---
|
|
|
|
## Gap 3 — Live UI state attached to chat messages
|
|
|
|
**Failure observed:** in the Dr Dave thread, user typed *"are you able
|
|
to give me a preview url?"* The AI didn't know which port the
|
|
Next.js dev server would bind to, what was already running, or
|
|
whether the user was looking at the chat or another tab. It
|
|
guessed and re-discovered everything from scratch.
|
|
|
|
In the Twenty thread, *"can you see the different sections?"* — user
|
|
meant Plan tab sections (Vision/Tasks/Decisions/Ideas). AI listed
|
|
metadata. No way to know.
|
|
|
|
**Why prompt rules can't fix this:** the AI literally lacks the
|
|
information.
|
|
|
|
**Proposed fix:** the chat panel sends a small `uiContext` object
|
|
alongside every user message. Inject into the system prompt as a
|
|
dynamic block (same shape as `activeBlock`):
|
|
|
|
```ts
|
|
{
|
|
currentRoute: "/mark-account/project/abc/hosting",
|
|
currentTab: "hosting",
|
|
visibleResources: [
|
|
{ kind: "app", uuid: "y4cs…", name: "vibn-frontend" },
|
|
{ kind: "service", uuid: "igcp…", name: "vibn-dev-twenty-crm" },
|
|
],
|
|
lastUserActions: [
|
|
{ at: "2m ago", action: "opened twenty-crm logs" },
|
|
{ at: "5m ago", action: "switched to Hosting tab" },
|
|
],
|
|
}
|
|
```
|
|
|
|
System-prompt block becomes:
|
|
|
|
> The user is currently looking at the **Hosting tab** (route: `…/hosting`).
|
|
> Visible resources: `vibn-frontend`, `vibn-dev-twenty-crm`.
|
|
> Recent actions: opened twenty-crm logs (2m ago), switched to Hosting (5m ago).
|
|
> When the user says "this" / "it" / "the URL" — assume they mean
|
|
> something visible in the current viewport unless they name something else.
|
|
|
|
**Effort:** ~3 hr. ~1 hr to wire the chat panel's
|
|
`uiContext` collection (existing route + tab state, last 5 actions
|
|
from a small ring buffer in the panel), ~1 hr to plumb through the
|
|
chat API, ~1 hr to add the prompt block.
|
|
|
|
**Slot into:** Phase 3 (UX surfaces) — pairs with 3.2 (structured
|
|
errors in chat) and 3.3 (empty-state nudges).
|
|
|
|
---
|
|
|
|
## Gap 4 — Diff preview / accept-changes gate
|
|
|
|
**Failure observed:** none yet, but the surface is exposed today —
|
|
`fs_edit` writes directly to `/workspace` in the dev container. For
|
|
ephemeral exploration this is correct (sub-second iteration is the
|
|
whole Path B point). For changes destined to ship, the user has no
|
|
review surface; they only see what changed after the AI summarizes.
|
|
|
|
**Why this matters for beta:** the moment a paying user wants to
|
|
"see what the AI changed before it goes live," there's nothing to
|
|
show them. Cursor's whole UX is built on diffs the user accepts.
|
|
|
|
**Proposed fix:** two-mode `fs_edit` / `fs_write`:
|
|
|
|
1. **Direct mode (default for dev container):** write immediately. Current
|
|
behavior. Fine for "make the button blue" iteration.
|
|
2. **Staged mode (default when `ship` is the next likely action):**
|
|
write to a shadow path, surface a diff in the chat UI, gate the
|
|
real write on a one-click "Accept" button.
|
|
|
|
The model decides which mode based on context — or simpler: stage when
|
|
the file is in a "protected" set (e.g. `prisma/schema.prisma`,
|
|
`Dockerfile`, `package.json`, anything in `prod/` or `migrations/`),
|
|
direct otherwise.
|
|
|
|
**Effort:** ~6 hr. ~2 hr backend (shadow write + apply endpoint),
|
|
~3 hr UI (diff renderer in the chat panel, accept/reject buttons),
|
|
~1 hr prompt + tool changes.
|
|
|
|
**Slot into:** Phase 4 (Onboarding & safety) — pairs with 4.5 (auth
|
|
hardening) and 4.6 (compute quotas) as part of "what a stranger
|
|
needs day 1."
|
|
|
|
---
|
|
|
|
## Suggested sequencing
|
|
|
|
If we ship in priority order:
|
|
|
|
1. **Gap 1 first** — kills the worst pattern in prod for ~2 hr of work. Should be ahead of any new feature in Phase 2.
|
|
2. **Gap 2 second** — closes the verify-deploy loop. Multiplies the value of every subsequent AI-shipped change because it's no longer blind.
|
|
3. **Gap 3 third** — tighter conversational UX. Once 1 and 2 work, the remaining UX cliff is "AI doesn't know what I'm looking at."
|
|
4. **Gap 4 last** — only matters once we have paying users editing prod-bound code. Pre-beta optional.
|
|
|
|
Total effort to ship 1+2+3 (the meaningful UX wins): **~9 hours.**
|
|
|
|
---
|
|
|
|
## How this changes BETA_LAUNCH_PLAN.md
|
|
|
|
Two new tasks slot in:
|
|
|
|
- **P2.8** Tool-error recovery middleware (Gap 1) — block on nothing, ship before P2.4.
|
|
- **P2.9** Browser-driver MCP tool (Gap 2) — block on nothing.
|
|
|
|
One new task in P3:
|
|
|
|
- **P3.7** UI-state injection into chat (Gap 3) — block on nothing.
|
|
|
|
Gap 4 stays out of beta scope unless eval reveals real damage from
|
|
unstaged edits.
|