Files
vibn-frontend/suggested-fixes.md

718 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Vibn AI Chat Route — Fix Instructions
This document specifies fixes to the `POST /api/chat` route and the tools it
orchestrates. The goal is to eliminate the "hallucinated tool calls" failure
mode where the assistant confidently claims to have edited files, restarted
servers, and applied design changes — but in reality wrote to a ghost path,
left a zombie process, or never invoked the tool at all.
Each fix is self-contained, numbered, and ends with verification steps. Apply
in order. Fixes 1 and 2 are the root causes; the rest are guardrails that
only become reliable after 1 and 2 land.
---
## Fix 1 — Make `dev_server_start` and `fs_*` tools project-path-aware
### Problem
The system prompt instructs the model to `cd <slug> &&` before every dev
server command and to write files under `/workspace/<slug>/`. Nothing
enforces this. When the model forgets — or hallucinates that it
remembered — `fs_write` silently creates a file at `/workspace/src/app/page.tsx`
while the running app lives at `/workspace/<slug>/src/app/page.tsx`. The
file exists; it just has zero relationship to the server.
### Required changes
In the tool implementations (likely in `@/lib/ai/vibn-tools` or wherever
`executeMcpTool` dispatches):
1. **Add a `projectSlug` parameter to the tool execution context.** Plumb
`activeProject.slug` from the chat route into `executeMcpTool`. It is
already passed `activeProject?.id`; add slug alongside.
2. **For `fs_read`, `fs_write`, `fs_edit`, `fs_glob`, `fs_grep`, `fs_list`,
`fs_delete`:** when `projectSlug` is set on the execution context, resolve
the `path` argument as follows:
- If `path` starts with `/workspace/<slug>/`, accept as-is.
- If `path` is relative (no leading slash), resolve relative to
`/workspace/<slug>/`.
- If `path` starts with `/workspace/` but **not** `/workspace/<slug>/`,
**reject the call** with an error:
`PATH_OUTSIDE_PROJECT: refusing to write to '<path>' because it is
outside the active project at '/workspace/<slug>/'. Did you mean
'/workspace/<slug>/<remainder>'?`
- If `path` starts with any other absolute path, reject the same way.
3. **For `dev_server_start`:** when `projectSlug` is set on the execution
context, automatically prepend `cd /workspace/<slug>/ && ` to the command
if the command does not already start with `cd /workspace/<slug>` or
`cd <slug>`. The model's prompt-level instruction to `cd` first becomes
belt-and-suspenders; the tool guarantees the working directory.
4. **For `shell_exec`:** do NOT auto-prepend `cd` (the model legitimately runs
commands at `/workspace` root for git ops on multiple repos). Instead,
when `projectSlug` is set and the command does not contain `cd <slug>` or
`cd /workspace/<slug>` anywhere, append a one-time warning to the tool
result: `[NOTE] You ran a shell command without cd'ing into
/workspace/<slug>. If you intended to operate on the active project, the
next call should start with 'cd <slug> && '.` Do not block — just inform.
### Verification
- Call `fs_write { path: '/workspace/src/app/page.tsx', content: 'x' }` from
a project-scoped thread. Expect `PATH_OUTSIDE_PROJECT` error.
- Call `fs_write { path: 'src/app/page.tsx', content: 'x' }` from a
project-scoped thread. Expect success at `/workspace/<slug>/src/app/page.tsx`.
- Call `dev_server_start { command: 'npm run dev', port: 3000 }` from a
project-scoped thread. Expect the underlying exec to run from
`/workspace/<slug>/`, not `/workspace/`.
---
## Fix 2 — Add verification data to mutation tool results
### Problem
The model claims "I just rewrote `page.tsx`" with no way for itself or the
harness to verify it. The next turn, the model sees its own confident claim
in history and treats it as fact. Hallucination compounds across turns.
### Required changes
Edit the return shape of these tools (likely in `@/lib/ai/vibn-tools`):
1. **`fs_write` and `fs_edit`** must return a JSON object including:
```json
{
"ok": true,
"path": "/workspace/<slug>/src/app/page.tsx",
"bytes": 4821,
"sha256": "a3f5...",
"mtime": "2026-05-16T18:42:11Z",
"preview": "first 200 chars of the new file content"
}
```
On failure, return `{ "ok": false, "error": "..." }` with no `sha256`.
2. **`dev_server_start`** must, on successful launch, perform a server-side
`curl -sS -o - -w '\n[status=%{http_code}]' http://localhost:<port>/`
with a 5s timeout, and include the result in the tool response:
```json
{
"ok": true,
"previewUrl": "https://...",
"port": 3000,
"healthCheck": {
"status": 200,
"bodyBytes": 18432,
"bodyHash": "b7c2...",
"bodySnippet": "<!DOCTYPE html><html..."
}
}
```
If the curl fails or returns non-2xx, set `ok: false` and surface
`healthCheck.status` plus the body snippet (often contains the Next.js
error overlay HTML, which is gold for diagnosis).
3. **`shell_exec`** already returns stdout/stderr; ensure exit code is
present as a top-level `exitCode` field, not buried in a string.
### Verification
- After a successful `fs_write`, the tool result must contain a `sha256`
field. Re-read the same file with `fs_read` and confirm the content hash
matches.
- After a successful `dev_server_start`, the tool result must contain a
`healthCheck.status` of 200 and a `bodyHash`. If the server is actually
serving a stale 502, `status` must reflect that — not be reported as
success.
---
## Fix 3 — Feed verified tool history back into the model's context
### Problem
In `/api/chat`, the history-loading code strips `toolCalls` from historical
assistant messages and never reloads tool results:
```js
if (msg.role === "assistant" && msg.toolCalls?.length) {
msg.toolCalls = undefined;
}
```
So on turn N+1 the model sees its own past prose ("I rewrote `page.tsx`")
with no record of whether the underlying tool actually fired. The model's
own claims become its only "evidence" of past actions.
### Required changes
In the chat route, in the history-loading block:
1. Keep the `toolCalls = undefined` stripping for OpenAI-compatibility
reasons (the comment is correct — DeepSeek rejects orphan tool_calls).
But **replace it with a compact summary appended to the assistant's
content**, before the trim:
```js
if (msg.role === "assistant" && msg.toolCalls?.length) {
const rawResults = msg._rawToolResults ?? [];
const summary = msg.toolCalls.map((tc) => {
const tr = rawResults.find((r) => r.name === tc.name);
let resultSig = "(no result captured)";
if (tr) {
try {
const parsed = typeof tr.result === "string"
? JSON.parse(tr.result)
: tr.result;
if (parsed && typeof parsed === "object") {
if (parsed.ok === false) {
resultSig = `ERROR: ${parsed.error ?? "unknown"}`;
} else if (parsed.sha256) {
resultSig = `ok bytes=${parsed.bytes} sha=${parsed.sha256.slice(0, 8)}`;
} else if (parsed.previewUrl) {
resultSig = `ok previewUrl=${parsed.previewUrl} health=${parsed.healthCheck?.status ?? "?"}`;
} else if (parsed.uuid) {
resultSig = `ok uuid=${parsed.uuid}`;
} else {
resultSig = "ok";
}
}
} catch {
resultSig = String(tr.result).slice(0, 80);
}
}
const argSig = JSON.stringify(tc.args ?? {}).slice(0, 100);
return ` - ${tc.name}(${argSig}) → ${resultSig}`;
}).join("\n");
const suffix = `\n\n[tools executed this turn:\n${summary}\n]`;
msg.content = (msg.content ?? "") + suffix;
msg.toolCalls = undefined;
}
```
2. This requires `_rawToolResults` to be persisted on every assistant
message — the code already does this (see the `_rawToolResults` block
near the end of the stream handler). No schema change needed.
### Verification
- Send a message, let the model edit a file, then send a follow-up. In
the second turn's history (log it server-side or inspect via the model's
reasoning), the assistant's previous message must end with a
`[tools executed this turn: ...]` block listing the actual tool calls
and their `sha256` / `previewUrl` / error result.
- Confirm DeepSeek and Gemini both accept the modified history without
the orphan-tool_calls error.
---
## Fix 4 — Broaden loop detection to semantic patterns
### Problem
Current detection fingerprints `tool_name|args` and bails on 3 identical
calls. In real loops the args differ slightly: `shell_exec pkill -f next`
→ `shell_exec npm run dev` → `shell_exec curl localhost:3000` → repeat.
Different `command` strings, no fingerprint collision, loop runs to the
30-round cap.
### Required changes
In `/api/chat`, replace the fingerprint construction:
```js
function fingerprintToolCall(tc) {
if (tc.name === "shell_exec") {
const cmd = String(tc.args?.command ?? "").trim();
// First non-cd verb (pkill, npm, curl, etc.)
const verb = cmd
.split("&&")
.map((s) => s.trim())
.find((s) => !s.startsWith("cd "))
?.split(/\s+/)[0] ?? "shell";
return `shell_exec:${verb}`;
}
if (tc.name === "fs_write" || tc.name === "fs_edit" || tc.name === "fs_read") {
return `${tc.name}:${tc.args?.path ?? ""}`;
}
if (tc.name === "dev_server_start" || tc.name === "dev_server_stop" ||
tc.name === "dev_server_logs" || tc.name === "dev_server_list") {
return `dev_server:${tc.args?.port ?? "?"}`;
}
if (tc.name === "apps_get" || tc.name === "apps_logs" ||
tc.name === "apps_deploy" || tc.name === "apps_unstick") {
return `${tc.name}:${tc.args?.uuid ?? ""}`;
}
const argSig = JSON.stringify(tc.args ?? {}).slice(0, 80);
return `${tc.name}:${argSig}`;
}
```
Then in the loop body:
```js
for (const tc of resp.toolCalls) {
toolFingerprints.push(fingerprintToolCall(tc));
}
// Sliding window of 10 (was 8); threshold 3 stays the same
const window = toolFingerprints.slice(-10);
const counts = new Map();
for (const fp of window) counts.set(fp, (counts.get(fp) ?? 0) + 1);
const repeated = [...counts.entries()].find(([, n]) => n >= 3);
if (repeated) {
loopBreakReason = `Repeated ${repeated[0]} ${repeated[1]}× in last 10 calls`;
}
```
### Verification
- Simulate the log's failure pattern: `shell_exec pkill -f next` →
`shell_exec npm run dev` → `shell_exec curl localhost:3000` → repeat
three times. The new fingerprinter should report
`Repeated shell_exec:pkill 3× in last 10 calls` and trigger
`loopBreakReason`.
- Confirm a legitimate fan-out (e.g. 5 different `fs_write` to 5
different files) does NOT trigger because each fingerprint is unique.
---
## Fix 5 — Lower silent-stretch nudge threshold to 2 rounds OR 5 tool calls
### Problem
Current `roundsSinceText >= 4` is too lenient. By round 4 the user has
already typed "are you there?" — exactly what happened at messages
1222 of the failing session. Also, `roundsSinceText` counts rounds,
not tool calls: one round with 8 parallel `fs_write` calls still reads
as a single round.
### Required changes
In `/api/chat`, track both round-count and tool-count since last text:
```js
let roundsSinceText = 0;
let toolCallsSinceText = 0;
```
After each model response:
```js
if (resp.text) {
// ... existing emit ...
roundsSinceText = 0;
toolCallsSinceText = 0;
} else if (resp.toolCalls.length) {
roundsSinceText++;
toolCallsSinceText += resp.toolCalls.length;
}
```
And update the nudge condition:
```js
const isSilent = roundsSinceText >= 2 || toolCallsSinceText >= 5;
let extraSystem = isSilent
? "\n\n[STATUS NUDGE] You have run "
+ `${toolCallsSinceText} tool call(s) over ${roundsSinceText} round(s) `
+ "without sending the user any text. Before any more tool calls, "
+ "send ONE short sentence describing what you are currently working "
+ "on and why. The user is staring at silent tool pills."
: "";
```
### Verification
- Trigger a turn where the model fires 6 parallel tool calls in one round
with no text. The next round must receive the `[STATUS NUDGE]` system
injection.
- Trigger a turn where the model fires 1 tool call per round across 3
rounds with no text. The third round must receive the nudge.
---
## Fix 6 — Broaden recovery-summary trigger beyond `assistantText.length === 0`
### Problem
The post-loop recovery summary only fires when the model produced **no
text at all**. In the failing log, the model produced lots of confident
text — it just happened to be wrong about what its tools did. The
recovery path never engaged.
### Required changes
In `/api/chat`, replace the `needsRecovery` condition:
```js
// Inspect the last 3 tool results for any explicit failure (ok: false,
// exit code != 0, http status >= 400). If the model wrote text but its
// final operations failed, we still want a forced honest summary.
function lastToolResultsHadFailure(messages, lookback = 3) {
const toolMsgs = messages.filter((m) => m.role === "tool").slice(-lookback);
for (const tm of toolMsgs) {
const raw = typeof tm.content === "string" ? tm.content : "";
try {
const parsed = JSON.parse(raw);
if (parsed.ok === false) return true;
if (typeof parsed.exitCode === "number" && parsed.exitCode !== 0) return true;
if (parsed.healthCheck?.status && parsed.healthCheck.status >= 400) return true;
if (typeof parsed.error === "string" && parsed.error.length > 0) return true;
} catch {
// non-JSON result, skip
}
}
return false;
}
const lastTurnHadTools =
messages.length > 0 && messages[messages.length - 1].role === "tool";
const needsRecovery =
!aborted &&
lastTurnHadTools &&
(round >= MAX_TOOL_ROUNDS ||
!!loopBreakReason ||
assistantText.trim().length === 0 ||
lastToolResultsHadFailure(messages));
```
When the recovery summary fires due to `lastToolResultsHadFailure`, the
injected `[RECOVERY]` system message should also note the failure:
```js
const failureNote = lastToolResultsHadFailure(messages)
? "Your last tool calls returned failures or non-2xx health checks. "
+ "Do NOT claim those operations succeeded. "
: "";
const reason = loopBreakReason
? `LOOP DETECTED: ${loopBreakReason}. Stop trying that approach. `
: round >= MAX_TOOL_ROUNDS
? "You hit the tool-round cap. "
: "";
// ... use `${reason}${failureNote}` in the [RECOVERY] prompt ...
```
### Verification
- Stub `dev_server_start` to return `{ ok: false, error: "EADDRINUSE" }`.
Send a request that causes the model to call it. Confirm a recovery
summary fires even if the model already wrote upbeat text.
- Confirm the recovery prompt includes the failure-note text in that case.
---
## Fix 7 — Lower `MAX_TOOL_ROUNDS` from 30 to 15
### Problem
30 rounds is too generous. In the failing log, none of the painful turns
that went past round 15 produced anything useful — they were just the
model trying to recover from its own hallucinations. The cap should
encourage handoff to the user, not extended thrashing.
### Required changes
In `/api/chat`:
```js
const MAX_TOOL_ROUNDS = 15;
```
The "[WARNING] You only have N tool calls left" injection at
`MAX_TOOL_ROUNDS - round <= 3` will naturally fire at round 12, which
is still plenty of headroom for the legitimate Path-B scaffold workflows
(scaffold → install → configure → start = ~7 tool calls).
### Verification
- Confirm a normal "build me a Next.js app and start the dev server"
request completes in well under 15 rounds.
- Confirm the warning prompt fires at round 13 (cap minus 3 minus 1
due to comparison being `<= 3` with round already incremented).
---
## Fix 8 — Add a hard-rule prompt clause forbidding unverified mutation claims
### Problem
The current "Never fake success" prompt clause is vibes, not a checkable
rule. The model has no instruction telling it *how* to know whether it
faked success.
### Required changes
In `buildSystemPrompt`, in the "Hard rules (non-negotiable)" section,
add the following clauses:
```
- **Cite the tool result, don't claim from memory.** Before stating
"I edited X" or "the server is running," you must point to a tool
result from THIS turn. If you can't, say "I have not yet made that
change — running the tool now" and then run it. A claim without a
citable tool result is a hallucination.
- **Trust the `ok` field.** Tool results carry an explicit `ok:
true|false`. If `ok` is false (or absent, or `exitCode` is non-zero,
or `healthCheck.status` is >= 400), the operation FAILED. Do not
describe a failed operation as successful. Report the error
verbatim.
- **`fs_write` and `fs_edit` results carry a `sha256` and `bytes`
field on success.** When you tell the user a file was changed,
include the byte count or the first 6 chars of the sha as evidence:
"Updated `page.tsx` (4.8kb, sha=a3f5c2…)." This protects both of
you from drift.
- **`dev_server_start` results carry a `healthCheck` field on
success.** Before telling the user "the preview is ready," confirm
`healthCheck.status === 200`. If it's 502 or empty, the server
isn't actually serving — report that, don't paper over it.
```
### Verification
- Inspect a sample of model outputs after this fix lands. They should
now include phrases like "Updated `page.tsx` (sha=a3f5c2…)" or
"Preview is up (health 200)" rather than bare "I just rewrote
page.tsx."
- Send a request that causes `dev_server_start` to return
`healthCheck.status: 502`. The model must NOT say "preview is
ready" — it must surface the 502.
---
## Fix 9 — Make `autoExtractPlanUpdates` non-authoritative until user-confirmed
### Problem
`autoExtractPlanUpdates` is fire-and-forget after every turn and writes
directly into `plan.decisions`, which the next turn's system prompt then
treats as a hard "DO NOT re-litigate" instruction. A single misfire (e.g.
the model says "I logged this as our Global Design System" → extraction
agrees → next turn locks in a decision that was never user-confirmed)
poisons the project state durably.
### Required changes
1. In the plan schema, add a `confidence` and `source` field to each
decision/task/idea written by auto-extraction:
```ts
{ title, choice, why, confidence: "auto" | "user", source: "extracted" | "explicit" }
```
2. In `autoExtractPlanUpdates`, write new items with `confidence: "auto"`,
`source: "extracted"`.
3. In `buildSystemPrompt`, in `decisionsBlock`, split decisions into two
lists:
```
**Decisions confirmed by the user (do not re-litigate):**
- <items where confidence === "user">
**Decisions inferred from prior chat (treat as tentative — confirm
with the user before relying on them):**
- <items where confidence === "auto">
```
4. When the user explicitly confirms (the model calls `plan_decision_log`
with the user's exact words, or the user clicks a confirm UI in the
Plan tab), upgrade `confidence` to `"user"`.
### Verification
- Trigger an auto-extraction that writes a decision. Confirm it appears
in the "tentative" list in the next turn's system prompt, not the
authoritative list.
- Have the user confirm the decision. Confirm `confidence` flips to
`"user"` and it moves to the authoritative list.
---
## Fix 10 — Surface auto-commit result in the user reply when something shipped
### Problem
`commitAndPushIfDirty` runs fire-and-forget AFTER the assistant message is
persisted. The model never sees the commit SHA or success state in-context,
so it cannot honestly report "committed as abc123." It just hopes.
### Required changes
Move the commit step to run BEFORE the final `emit({ type: "done" })`,
and wait for its result with a tight timeout (e.g. 8s). Stream the result
to the client as a structured event:
```js
if (activeProject?.id && activeProject?.slug && typeof activeProject?.giteaCloneUrl === "string") {
try {
await ensureProjectRepoCloned({ /* ... */ }).catch(() => null);
const firstSentence = (assistantText || "")
.split(/(?<=[.!?])\s+/)[0]?.trim()?.slice(0, 180);
const commitMessage = firstSentence || "AI checkpoint";
const commitPromise = commitAndPushIfDirty({
projectId: activeProject.id,
projectSlug: activeProject.slug,
message: commitMessage,
});
const timeoutPromise = new Promise((resolve) =>
setTimeout(() => resolve({ committed: false, reason: "timeout" }), 8000)
);
const result = await Promise.race([commitPromise, timeoutPromise]);
if (result.committed) {
emit({ type: "commit", sha: result.sha, pushed: result.pushed });
} else if (result.reason && result.reason !== "clean" && result.reason !== "no_repo") {
emit({ type: "commit_failed", reason: result.reason });
}
} catch (err) {
emit({ type: "commit_failed", reason: String(err) });
}
}
```
The client should render `commit` events as a small inline badge ("✓
abc123f"). The model itself can be made aware on the NEXT turn via the
tool-history rewrite from Fix 3 — include the commit SHA in the
synthetic `[tools executed this turn: ...]` block.
### Verification
- Make a file edit. Confirm the chat reply ends with a visible commit SHA
badge in the UI.
- On a clean turn (no file changes), confirm no commit event fires.
- On a commit timeout (simulate by stubbing `commitAndPushIfDirty` to
hang), confirm the response still completes within ~8s with a
`commit_failed: reason=timeout` event.
---
## Fix 11 — Populate and front-load `buildCodebaseSummary` in the system prompt
### Problem
Evidence from the failing session: the agent ran `fs_tree` 5 times and
`fs_glob "**/page.tsx"` 9+ times across one session, repeatedly
re-discovering the same file location. Path tallies across the session
showed FIVE different path conventions used for the homepage:
`getacquired-2-0/src/app/page.tsx` (70x), `src/app/page.tsx` (30x),
empty string `""` (19x), `/workspace/getacquired-2-0/src/app/page.tsx`
(10x), and `app/page.tsx` (1x). The agent never learned which one the
tools expected because the codebase summary that should have told it
was missing or empty.
Separately, project plan context (vision, audience, brief) is technically
included in the system prompt, but the agent only called `plan_get` ONCE
in 65 turns — and only after the user explicitly asked "can you look at
the plan please to understand the market we are serving." For 16 turns
before that, the agent designed a Wall Street aesthetic for a marketplace
that was actually selling HVAC and plumbing businesses, because the
audience field was buried in a long `activeBlock` after operational
instructions.
### Required changes
1. **Make `buildCodebaseSummary` actually return content.** Inspect
`@/lib/ai/project-context/codebase-summary` and verify it:
- Detects the project root by finding `package.json` under
`/workspace/<slug>/`.
- Reads framework signals (Next.js vs Vite, React vs Vue) from
`package.json` dependencies.
- Lists 1020 most-important file paths grouped by purpose
(app/, components/, lib/, public/, api/, schema/).
- Returns a deterministic block under ~1500 chars, e.g.:
```
[CODEBASE — /workspace/<slug>/]
Framework: Next.js 16.2 (App Router) + Tailwind + Sentry
Package manager: npm (lockfile: package-lock.json)
Entry points:
- src/app/page.tsx (homepage)
- src/app/layout.tsx (root layout)
- src/app/api/auth/[...nextauth]/route.ts
UI components:
- src/shared/components/layout/Navbar.tsx
- src/shared/components/layout/Footer.tsx
Styles: src/app/globals.css, tailwind.config.ts
Path convention for fs_*: paths are relative to /workspace/<slug>/.
'src/app/page.tsx' is correct. Do NOT prefix with the slug.
```
- On a fresh project before scaffold, return a short placeholder:
`[CODEBASE — /workspace/<slug>/] Empty repo; ready for scaffold.`
2. **Cache the codebase summary aggressively** — recompute only when
`package.json` or directory structure changes (track mtime). The
current implementation likely awaits a fresh computation on every
turn, which is both slow and a source of "summary unavailable" stub
returns when the dev container isn't fully clone-ready yet.
3. **Front-load the audience/vision/brief.** In `buildSystemPrompt`,
move the project-identity block (name, audience, vision, brief
snippet) to the TOP of the system prompt, BEFORE the long
"How Vibn is structured / Common questions → tools / How to deploy"
section. The current ordering buries the most-decision-relevant
context after ~3000 chars of operational scaffolding.
4. **Add an explicit "if you don't know the codebase, read this first"
reminder** at the top of the codebase block:
```
[CODEBASE] This block tells you the repo layout. You should NOT
need fs_tree or fs_glob "**/page.tsx" to find common files — they're
listed below. Use fs_tree only when you suspect the layout has
changed or you need to explore a subdirectory not listed here.
```
### Verification
- After this fix, replay a "modify the homepage" request against a
populated project. The agent should call `fs_read` on `src/app/page.tsx`
directly with the correct relative path, with no preceding `fs_tree`
or `fs_glob` discovery calls.
- Inspect the rendered system prompt for a fresh project-scoped thread.
Confirm the `[CODEBASE — ...]` block is present, lists at least the
app entry points, and explicitly states the path convention.
- Replay the failing session's first user message ("apply your design
skills to the home page"). The agent should reference the project's
audience (SMB / Main Street buyers) in its first response, without
needing a `plan_get` call, because the brief is now front-loaded.
---
## Rollout order
1. **Fix 11 (codebase summary populated and front-loaded)** — landed
first because it eliminates the 10+ redundant discovery calls per
turn and gives the agent ground truth about path conventions before
Fix 1's enforcement starts rejecting wrong-path writes.
2. Fix 1 (path scoping enforcement) — eliminates the ghost-file class
of bug entirely. Should land alongside Fix 11 so the agent knows
what the rules are before they start rejecting writes.
3. Fix 2 (verification data in tool results) — lands the schema the rest
of the fixes depend on.
4. Fix 3 (history rewrite with tool summaries) — only valuable AFTER Fix
2 is producing sha256/healthCheck fields.
5. Fix 8 (hard-rule prompt clauses citing sha256/healthCheck) — depends
on Fixes 2 and 3.
6. Fix 4 (semantic loop detection) — independent, can ship anytime.
7. Fix 5 (silent-stretch nudge threshold) — independent.
8. Fix 6 (recovery on tool failure) — depends on Fix 2 for `ok` and
`healthCheck` fields.
9. Fix 7 (lower MAX_TOOL_ROUNDS to 15) — independent, ship last after
confirming legitimate workflows still fit.
10. Fix 9 (auto-extraction confidence tiers) — independent.
11. Fix 10 (sync commit with streamed result) — independent.
After all eleven land, replay the failing session's prompts against the
new harness and confirm:
- The agent references the project's audience and brief in its first
response, without a `plan_get` call (Fix 11).
- No redundant `fs_tree` / `fs_glob "**/page.tsx"` discovery sequences
(Fix 11).
- No "ghost file" path scoping errors (Fix 1).
- No "I just rewrote page.tsx" claim without a citable sha256 (Fixes 2,
3, 8).
- No 20+ round runaway loops (Fixes 4, 7).
- No silent stretches >2 rounds (Fix 5).
- Failed `dev_server_start` calls produce honest recovery prose, not
"preview is ready" hallucinations (Fixes 6, 8).