This repository has been archived on 2026-06-07. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
master-ai/may-19-ai-review.md

601 lines
21 KiB
Markdown

# Vibn Chat Harness — Fix Checklist
Work through items in order. Each fix has a clear **What**, **Where**,
**How**, and **Verify** section. Don't skip the verify step — many of
these fixes interact with each other and silent failures will
compound.
Mark `[x]` as you complete each item. If you can't complete an item,
add a short note under it explaining why and move on.
---
## Phase 1 — Backend fixes (highest leverage; do these first)
These three fix the failure modes the prompt currently promises but
the backend doesn't deliver. Until they're done, the prompt's hard
rules are partly fiction.
### [ ] 1. Add `sha256` and `bytes` to `fs.write` and `fs.edit` responses
**What:** The prompt's hard rules tell the model to cite `sha256` and
`bytes` as evidence of file changes. The tools don't return those
fields today, so the model is looking for evidence that doesn't exist.
**Where:** `app/api/mcp/route.ts` — functions `toolFsWrite` and
`toolFsEdit`.
**How:**
- In `toolFsWrite`, after the `runFsCmd` success branch, before
returning, compute the sha256 of `content` and return it alongside
`bytesWritten` renamed to `bytes`:
```ts
import { createHash } from "crypto";
// ...
const bytes = Buffer.byteLength(content, "utf8");
const sha256 = createHash("sha256").update(content, "utf8").digest("hex");
return NextResponse.json({
result: { ok: true, path, bytes, sha256 },
});
```
- In `toolFsEdit`, you don't have the final content in memory. Add a
second command that prints the sha + bytes after the edit:
```ts
const cmd = `python3 -c "$(printf %s ${shq(pyB64)} | base64 -d)" <<< "$(printf %s ${shq(b64)} | base64 -d)" && echo "---" && sha256sum ${shq(path)} | cut -d' ' -f1 && wc -c < ${shq(path)}`;
```
Then parse the trailing two lines after `---` for sha and bytes.
- Update the response shape:
```ts
return NextResponse.json({
result: { ok: true, path, replacements, bytes, sha256 },
});
```
**Verify:**
- [ ] Call `fs_write` with `{ path: "test.txt", content: "hello" }`.
Confirm response contains `sha256` (64 hex chars) and `bytes: 5`.
- [ ] Call `fs_edit` to change the same file. Confirm response
contains a new `sha256` and updated `bytes`.
- [ ] Replay a turn that does `fs_write` followed by `fs_read` of the
same file in chat. The model should now produce text like
"Updated `test.txt` (sha=a3f5c2…, 5b)" instead of a bare claim.
---
### [ ] 2. Add project-slug scoping to `normalizeFsPath`
**What:** The prompt tells the model to use paths like `src/app/page.tsx`
and claims the tool layer rejects writes outside the project root.
The tool layer does NOT do this today. It resolves all relative paths
under `/workspace` (workspace-level), so `fs_write { path: "src/app/page.tsx" }`
ends up at `/workspace/src/app/page.tsx` — the ghost file from the
failing session. Five different path conventions were used for the
same file in one session because nothing enforces the rule.
**Where:** `app/api/mcp/route.ts` — function `normalizeFsPath` and
every caller in `toolFsRead`, `toolFsWrite`, `toolFsEdit`,
`toolFsList`, `toolFsDelete`, `toolFsGlob`, `toolFsGrep`, `toolFsTree`,
`toolRequestVisualQA`, `toolGenerateMedia`.
**How:**
- Change `normalizeFsPath` to accept an optional `projectSlug`:
```ts
function normalizeFsPath(
p: string,
projectSlug?: string,
): string | NextResponse {
if (!p || typeof p !== "string") {
return NextResponse.json(
{ error: 'Param "path" is required' },
{ status: 400 },
);
}
const projectRoot = projectSlug ? `${FS_ROOT}/${projectSlug}` : FS_ROOT;
let abs: string;
if (p.startsWith("/")) {
abs = p;
} else {
abs = `${projectRoot}/${p}`.replace(/\/+/g, "/");
}
const norm = abs.replace(/\/[^/]+\/\.\.(?=\/|$)/g, "").replace(/\/+/g, "/");
// When projectSlug is set, REJECT paths outside the project root.
if (projectSlug) {
if (!norm.startsWith(projectRoot) && norm !== projectRoot) {
return NextResponse.json(
{
ok: false,
error: `PATH_OUTSIDE_PROJECT: path "${p}" resolves to "${norm}" which is outside the active project at "${projectRoot}". Did you mean "${projectRoot}/${p.replace(/^\/+/, "")}"?`,
},
{ status: 400 },
);
}
} else {
// Workspace-level fallback (legacy behaviour)
if (!norm.startsWith(FS_ROOT) && norm !== FS_ROOT) {
return NextResponse.json(
{ error: `Path "${p}" is outside ${FS_ROOT}; use shell.exec for system paths.` },
{ status: 400 },
);
}
}
return norm;
}
```
- In every fs_* tool that already calls `resolveProjectOr404`, pass
the slug:
```ts
const path = normalizeFsPath(String(params.path ?? ""), project.slug);
```
- `toolFsRead`, `toolFsWrite`, `toolFsEdit`, `toolFsDelete`,
`toolRequestVisualQA`, `toolGenerateMedia` all already have
`project` in scope — pass `project.slug`.
- `toolFsList`, `toolFsGlob`, `toolFsGrep`, `toolFsTree` use
`params.path` or `params.cwd` — same treatment, pass `project.slug`.
**Verify:**
- [ ] From a project-scoped thread, call `fs_write { path: "/workspace/src/app/page.tsx", content: "x" }`.
Expect `PATH_OUTSIDE_PROJECT` error.
- [ ] From the same thread, call `fs_write { path: "src/app/page.tsx", content: "x" }`.
Expect success at `/workspace/<slug>/src/app/page.tsx`.
- [ ] Confirm `dev_server_start` with `command: "npm run dev"` runs
from the project root, not `/workspace`. (This is mostly already
true via dev-container logic; just confirm.)
---
### [ ] 3. Fix the broken `plan-extract` block
**What:** The fire-and-forget `plan-extract` block in
`app/api/chat/route.ts` has a syntax error — the `try` block builds a
transcript and then hits `}` followed by `catch` with no actual call
to `autoExtractPlanUpdates`. The body of the try is missing. Either
the auto-extraction was intentionally removed (in which case the
dead transcript-building code should also be deleted) or it was
accidentally truncated (in which case the call needs to be restored).
**Where:** `app/api/chat/route.ts`, around the second fire-and-forget
block (after the title/summary block, before `emit({ type: "done" })`).
**How:**
- Decide first: do we want auto-extraction to run? If YES, restore
the call:
```ts
(async () => {
try {
if (!threadProjectId) return;
const allMessages = [...history, finalMsg];
if (allMessages.length < 2) return;
const transcript = allMessages
.map((m) => {
const text =
typeof m.content === "string"
? m.content
: JSON.stringify(m.content);
return `${m.role.toUpperCase()}: ${text.slice(0, 1200)}`;
})
.join("\n\n");
const result = await autoExtractPlanUpdates(
threadProjectId,
transcript,
);
if (result) {
console.log(
"[chat] plan-extract:",
`${result.tasks} tasks, ${result.decisions} decisions, vision=${result.vision}`,
);
}
} catch (err) {
console.warn("[chat] plan-extract failed (non-fatal):", err);
}
})().catch(() => {});
```
And re-add the `import { autoExtractPlanUpdates } from "@/lib/ai/plan-extract";`
at the top of the file.
- If NO (you removed it intentionally), delete the entire IIFE
including the transcript-building so the file compiles cleanly.
**Verify:**
- [ ] Run `tsc --noEmit` on the file. Confirm no syntax errors.
- [ ] If auto-extraction restored: have a chat that mentions a
decision (e.g. "let's use Postgres"). Confirm a new entry appears
in the project's `plan.decisions` with `confidence: "auto"`.
- [ ] Tail prod logs for `[chat] plan-extract:` — should fire on
every turn with content.
---
## Phase 2 — Prompt fixes (now that the backend matches)
These bring the prompt into line with what the tools actually do.
### [ ] 4. Fix the `apps_containers_list` typo in the prompt
**What:** The troubleshooting section references `apps_containers_list`
but the actual tool is `apps_containers_ps`. The model will call a
tool that doesn't exist.
**Where:** `app/api/chat/route.ts`, inside `buildSystemPrompt`, in the
"## Troubleshooting" section.
**How:**
- Find: `apps_logs { uuid } + apps_containers_list { uuid }`
- Replace: `apps_logs { uuid } + apps_containers_ps { uuid }`
**Verify:**
- [ ] Grep the prompt for `apps_containers_list` — no matches.
- [ ] Grep for `apps_containers_ps` — should appear in
troubleshooting and at least once in the apps section.
---
### [ ] 5. Soften the `ok` field rule
**What:** The current rule says "If `ok` is false (or absent, or
`exitCode` is non-zero, or `healthCheck.status` is >= 400) the
operation FAILED." The "or absent" clause is wrong — many tools
return data without an `ok` field (e.g. `projects_get`, `apps_list`,
`databases_get`). The model will treat every read as a failure.
**Where:** `buildSystemPrompt`, "Hard rules" section, the "Trust the
`ok` field" bullet.
**How:**
Replace the current rule with:
```
- **Read tool results carefully.** A tool FAILED when ANY of these
signals are present: `ok: false`, `error: "..."`, a non-zero
`exitCode`, or a `healthCheck.status` >= 400. If NONE of those
signals are present, look at the actual content of the response
to decide whether the operation succeeded. Many read-only tools
return data directly without an `ok` field — that's not a failure.
```
**Verify:**
- [ ] Pick a recent thread where the agent called `projects_get` or
`apps_list`. Confirm the agent didn't treat the response as a
failure (look at its post-tool text — should be a normal summary,
not "the operation failed").
---
### [ ] 6. Tighten the status-nudge threshold
**What:** Current thresholds are `roundsSinceText >= 8 ||
toolCallsSinceText >= 12`. With `MAX_TOOL_ROUNDS = 8`, the round-based
nudge can never fire (loop ends first). The tool-call threshold of 12
is also too lenient — users typed "test" / "hello" by round 4-5 of
silence in the failing session.
**Where:** `app/api/chat/route.ts`, near the top of the main while
loop, the `isSilent` constant.
**How:**
```ts
const isSilent = roundsSinceText >= 3 || toolCallsSinceText >= 6;
```
**Verify:**
- [ ] Replay a chat that triggers 6+ tool calls without text. Confirm
the `[STATUS NUDGE]` system addendum is injected before the next
round.
- [ ] Confirm the model produces a one-line status sentence in
response to the nudge.
---
### [ ] 7. Update the path-convention guidance in the prompt
**What:** After Fix 2 ships, the path convention is now enforced. The
prompt should state this plainly without the "cd into your project"
workaround.
**Where:** `buildSystemPrompt`, inside `activeBlock`, the path
guidance section. Also inside "Dev servers" → "Directory" bullet.
**How:**
Replace the "Directory" bullet under "Dev servers":
```
- **Directory:** Tool paths are scoped to your project root
automatically. Pass `command: "npm run dev"` directly — no `cd`
prefix needed. The tool rejects any `fs_*` write outside
`/workspace/<slug>/`.
```
And after the "Project repo is auto-cloned" paragraph in
`activeBlock`, add:
```
**Path convention for fs_* tools:** Pass paths relative to the
project root — `src/app/page.tsx`, NOT `/workspace/<slug>/src/app/page.tsx`
and NOT `<slug>/src/app/page.tsx`. The tool layer rejects writes
outside the project root with a `PATH_OUTSIDE_PROJECT` error
suggesting the corrected path.
```
**Verify:**
- [ ] In a fresh chat, ask the agent to "edit the homepage". Confirm
the first `fs_read` call uses `src/app/page.tsx` (no slug prefix,
no `/workspace/` prefix).
---
## Phase 3 — Polish and safety nets
These are lower-priority but each removes a small foot-gun.
### [ ] 8. Add `fs_tree` recommendation to first-turn behavior
**What:** The agent ran `fs_tree` 5 times and `fs_glob` 9+ times in
the failing session, re-discovering paths it should have learned once.
The tool description already says "ALWAYS call this first" but the
prompt doesn't reinforce it.
**Where:** `buildSystemPrompt`, in the "Writing code" section.
**How:**
Add this near the top of "Writing code — dev container is the default":
```
**Orient yourself once.** On the first code-modifying turn of a
chat, call `fs_tree` once to learn the repo layout. Don't re-run it
on every turn — the layout doesn't change between user messages.
```
**Verify:**
- [ ] Manual review of the next 5 sessions: confirm `fs_tree` is
called at most once per chat (not per turn).
---
### [ ] 9. Add `browser_navigate` and `browser_console` as verification primitives
**What:** The backend has `browser_navigate` and `browser_console`
tools that headlessly render a page and capture console errors. The
prompt never mentions them. These are the missing post-deploy
verification step that the `healthCheck` field gestures at.
**Where:** `buildSystemPrompt`, in the "Dev servers" subsection or as
its own section after "Visual QA".
**How:**
Add a new bullet right after "Visual QA" guidance:
```
**Verify the page actually renders:**
- After `dev_server_start` returns a `previewUrl` AND `healthCheck.status === 200`,
for any UI-facing turn, call `browser_console { url: previewUrl }` to
capture frontend console errors. Hydration errors, missing assets,
and uncaught exceptions show up here even when the server is
technically "running".
- If `browser_console` returns errors, fix them with `fs_edit`
before declaring done. A green `healthCheck` plus a clean console
is the real "done" signal for UI work.
- Skip this for backend / SQL / config-only changes.
```
**Verify:**
- [ ] On the next UI-modifying chat, confirm `browser_console` is
called once after `dev_server_start`.
- [ ] Confirm any errors it returns get acknowledged in the agent's
reply.
---
### [ ] 10. Add the market research stack to the prompt
**What:** The backend exposes six market research tools
(`market_categories_suggest`, `market_research_run`, `market_seo_analyze`,
`tech_stack_analyze`, `market_competitor_research`,
`market_aggregate_insights`). The prompt never mentions them. For a
non-technical-founder product, these are some of the highest-leverage
tools — they answer "should I build for dentists or summer camps?"
with real TAM counts.
**Where:** `buildSystemPrompt`, add a new section after "Common
questions → tools" or before "How to deploy".
**How:**
Add this section:
```
## Helping the user pick what to build
Vibn has a market-research toolkit for non-technical founders who
need data on their target niche. Use it when the user is undecided,
validating an idea, or comparing markets:
- **"How big is the market for X in <location>?"** → `market_categories_suggest { niche }` to
propose Google Business categories, then `market_research_run` after
the user approves. Returns TAM count, sample domains, and review
data. NOTE: `market_research_run` costs real money — always confirm
with the user and pass `user_explicitly_approved: true`.
- **"What are competitors spending on Google Ads?"** → `market_seo_analyze { domain }`.
Returns organic traffic, paid traffic, ad spend, and top paid
keywords. Use to tell the user how aggressive a market is.
- **"What software do these businesses already use?"** → `tech_stack_analyze { urls, software_category_id }`.
Detects WordPress, Shopify, named competitors, and any custom
domains/scripts you pass. Use to find "X businesses use WordPress
but lack Y" market gaps.
- **"What are customers complaining about?"** → `market_aggregate_insights { category, location }`.
Returns top review topics — use the actual words customers use as
marketing copy and value-prop seeds.
- **"Who are the players in this niche?"** → `market_competitor_research { niche }`.
Returns proprietary competitors with pricing AND open-source
alternatives that could be forked.
These are conversational research tools — they don't build anything.
Use them BEFORE scaffolding when the user is exploring direction;
SKIP them once the user has committed to building.
```
**Verify:**
- [ ] In a fresh chat, ask the agent "should I build for dentists or
summer camps?". Confirm it proposes using `market_categories_suggest`
or `market_aggregate_insights` rather than guessing.
---
### [ ] 11. Surface `apps_exec`, `auth_create`, `generate_media`, `storage_*` briefly
**What:** Several capabilities the agent has are completely absent
from the prompt. The agent doesn't know it can run commands inside
production containers, deploy real auth servers, generate images, or
wire S3 storage. One-line mentions are enough.
**Where:** `buildSystemPrompt`, scattered across existing sections.
**How:**
- Under "Common questions → tools", add:
```
- "Run a migration / psql in prod" → `apps_exec { uuid, command }`.
- "Generate a hero image / illustration" → `generate_media { prompt, type, outputPath }`.
- "Wire up file storage / uploads" → `storage_provision` (if not already), then `storage_inject_env { uuid }`.
```
- In the "Decision defaults" section, augment the auth bullet:
```
- **Auth:** NextAuth with email magic-link for in-app auth.
Deploy a separate Pocketbase / Authentik / Keycloak service via
`auth_create { provider }` only if the user needs SSO, multi-app
SSO, or admin user management.
```
**Verify:**
- [ ] Ask the agent "can you run a SQL migration on prod?". Confirm
it references `apps_exec`.
- [ ] Ask "I need a hero image for the landing page". Confirm it
references `generate_media`.
---
### [ ] 12. Flip the `fs_edit` preference to match the tool's documentation
**What:** The tool description says `startLine`/`endLine` is
preferred and `oldString` is the fallback. The prompt currently says
the opposite ("prefer `oldString` for small replacements"). The tool
is authoritative — match it.
**Where:** `buildSystemPrompt`, in the "Iterate" bullet under "Writing
code", the `fs_edit` guidance.
**How:**
Replace the current `fs_edit` guidance with:
```
- `fs_read` / `fs_write` / `fs_edit { path, oldString, newString, startLine, endLine }`.
**For `fs_edit`:** prefer `startLine`/`endLine` (deterministic; never
fails on duplicate strings). Use `oldString` only when you cannot
read the file first to get line numbers — and when you do, include
2-3 lines of surrounding context for uniqueness. If `fs_edit` keeps
failing, do NOT escape to `shell_exec` with patch scripts — read
the file fresh with `fs_read`, get the line numbers, and try again.
```
**Verify:**
- [ ] Confirm next 3 `fs_edit` calls in the wild use `startLine`/`endLine`,
not `oldString`.
---
## Phase 4 — Monitoring (after Phases 1-3 land)
Once the changes are in production, watch these for a week. Tune if
the numbers don't move.
### [ ] 13. Track the recovery-summary fire rate
**What:** The `needsRecovery` path runs when a turn ends badly (hit
the round cap, hit a loop, or last tool returned failure). It should
fire on <10% of turns. If it fires more often, the cap is too low or
the model is hitting real bugs.
**How:** Add a metric. In the `needsRecovery` block, before calling
`callVibnChat`, emit:
```ts
console.log("[chat] recovery_fired", {
turnId,
reason: loopBreakReason ? "loop"
: round >= MAX_TOOL_ROUNDS ? "round_cap"
: assistantText.trim().length === 0 ? "no_text"
: "tool_failure",
toolCalls: assistantToolCalls.length,
});
```
**Verify:**
- [ ] Aggregate over 1 week. If `round_cap` is the dominant reason
and the turns look legitimate, raise `MAX_TOOL_ROUNDS` to 10 or 12.
- [ ] If `loop` is dominant, the fingerprinter may need tuning.
---
### [ ] 14. Track conversational-guard fire rate
**What:** The first-turn conversational guard (regex match on user
message → no tools on round 1) is the biggest single behavioral
change in this revision. We want to know how often it fires and
whether it ever causes a problem.
**How:** Before the loop, after computing `firstMessageIsConversational`:
```ts
console.log("[chat] turn_start", {
turnId,
firstMessageIsConversational,
messagePreview: message.trim().slice(0, 80),
});
```
**Verify:**
- [ ] Aggregate over 1 week. Should fire on ~30-40% of first messages
in a chat.
- [ ] Spot-check 10 cases where it fired: confirm none were
legitimate "build me X" requests being miscategorized.
---
### [ ] 15. Track `PATH_OUTSIDE_PROJECT` rejections
**What:** After Fix 2 ships, this error tells you how often the model
was about to write to the wrong path. Should taper toward zero as the
prompt guidance + enforcement settles in.
**How:** Server-side log in `normalizeFsPath` when the project-scoped
rejection fires.
**Verify:**
- [ ] Week 1: count rejections per day.
- [ ] Week 2: count should be lower (model has internalized the rule
via the error messages it's seen).
---
## What this checklist deliberately doesn't include
A few things from earlier reviews that I'm intentionally leaving off:
- **Phase-aware behavior.** Phases were removed from the product; the
prompt no longer references them. No work needed.
- **Codebase summary auto-generation.** Lower priority once Fix 2
ships (paths can no longer drift) and Fix 8 lands (one `fs_tree`
per chat instead of nine).
- **Tool history reconstruction for DeepSeek compatibility.** Already
shipped in the current code via the `_rawToolResults` compact
summary. No additional work.
Don't add work for these unless a clear failure mode appears in
production after Phases 1-3 land.