601 lines
21 KiB
Markdown
601 lines
21 KiB
Markdown
# Vibn Chat Harness — Fix Checklist
|
|
|
|
Work through items in order. Each fix has a clear **What**, **Where**,
|
|
**How**, and **Verify** section. Don't skip the verify step — many of
|
|
these fixes interact with each other and silent failures will
|
|
compound.
|
|
|
|
Mark `[x]` as you complete each item. If you can't complete an item,
|
|
add a short note under it explaining why and move on.
|
|
|
|
---
|
|
|
|
## Phase 1 — Backend fixes (highest leverage; do these first)
|
|
|
|
These three fix the failure modes the prompt currently promises but
|
|
the backend doesn't deliver. Until they're done, the prompt's hard
|
|
rules are partly fiction.
|
|
|
|
### [ ] 1. Add `sha256` and `bytes` to `fs.write` and `fs.edit` responses
|
|
|
|
**What:** The prompt's hard rules tell the model to cite `sha256` and
|
|
`bytes` as evidence of file changes. The tools don't return those
|
|
fields today, so the model is looking for evidence that doesn't exist.
|
|
|
|
**Where:** `app/api/mcp/route.ts` — functions `toolFsWrite` and
|
|
`toolFsEdit`.
|
|
|
|
**How:**
|
|
- In `toolFsWrite`, after the `runFsCmd` success branch, before
|
|
returning, compute the sha256 of `content` and return it alongside
|
|
`bytesWritten` renamed to `bytes`:
|
|
```ts
|
|
import { createHash } from "crypto";
|
|
// ...
|
|
const bytes = Buffer.byteLength(content, "utf8");
|
|
const sha256 = createHash("sha256").update(content, "utf8").digest("hex");
|
|
return NextResponse.json({
|
|
result: { ok: true, path, bytes, sha256 },
|
|
});
|
|
```
|
|
- In `toolFsEdit`, you don't have the final content in memory. Add a
|
|
second command that prints the sha + bytes after the edit:
|
|
```ts
|
|
const cmd = `python3 -c "$(printf %s ${shq(pyB64)} | base64 -d)" <<< "$(printf %s ${shq(b64)} | base64 -d)" && echo "---" && sha256sum ${shq(path)} | cut -d' ' -f1 && wc -c < ${shq(path)}`;
|
|
```
|
|
Then parse the trailing two lines after `---` for sha and bytes.
|
|
- Update the response shape:
|
|
```ts
|
|
return NextResponse.json({
|
|
result: { ok: true, path, replacements, bytes, sha256 },
|
|
});
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] Call `fs_write` with `{ path: "test.txt", content: "hello" }`.
|
|
Confirm response contains `sha256` (64 hex chars) and `bytes: 5`.
|
|
- [ ] Call `fs_edit` to change the same file. Confirm response
|
|
contains a new `sha256` and updated `bytes`.
|
|
- [ ] Replay a turn that does `fs_write` followed by `fs_read` of the
|
|
same file in chat. The model should now produce text like
|
|
"Updated `test.txt` (sha=a3f5c2…, 5b)" instead of a bare claim.
|
|
|
|
---
|
|
|
|
### [ ] 2. Add project-slug scoping to `normalizeFsPath`
|
|
|
|
**What:** The prompt tells the model to use paths like `src/app/page.tsx`
|
|
and claims the tool layer rejects writes outside the project root.
|
|
The tool layer does NOT do this today. It resolves all relative paths
|
|
under `/workspace` (workspace-level), so `fs_write { path: "src/app/page.tsx" }`
|
|
ends up at `/workspace/src/app/page.tsx` — the ghost file from the
|
|
failing session. Five different path conventions were used for the
|
|
same file in one session because nothing enforces the rule.
|
|
|
|
**Where:** `app/api/mcp/route.ts` — function `normalizeFsPath` and
|
|
every caller in `toolFsRead`, `toolFsWrite`, `toolFsEdit`,
|
|
`toolFsList`, `toolFsDelete`, `toolFsGlob`, `toolFsGrep`, `toolFsTree`,
|
|
`toolRequestVisualQA`, `toolGenerateMedia`.
|
|
|
|
**How:**
|
|
|
|
- Change `normalizeFsPath` to accept an optional `projectSlug`:
|
|
```ts
|
|
function normalizeFsPath(
|
|
p: string,
|
|
projectSlug?: string,
|
|
): string | NextResponse {
|
|
if (!p || typeof p !== "string") {
|
|
return NextResponse.json(
|
|
{ error: 'Param "path" is required' },
|
|
{ status: 400 },
|
|
);
|
|
}
|
|
const projectRoot = projectSlug ? `${FS_ROOT}/${projectSlug}` : FS_ROOT;
|
|
let abs: string;
|
|
if (p.startsWith("/")) {
|
|
abs = p;
|
|
} else {
|
|
abs = `${projectRoot}/${p}`.replace(/\/+/g, "/");
|
|
}
|
|
const norm = abs.replace(/\/[^/]+\/\.\.(?=\/|$)/g, "").replace(/\/+/g, "/");
|
|
|
|
// When projectSlug is set, REJECT paths outside the project root.
|
|
if (projectSlug) {
|
|
if (!norm.startsWith(projectRoot) && norm !== projectRoot) {
|
|
return NextResponse.json(
|
|
{
|
|
ok: false,
|
|
error: `PATH_OUTSIDE_PROJECT: path "${p}" resolves to "${norm}" which is outside the active project at "${projectRoot}". Did you mean "${projectRoot}/${p.replace(/^\/+/, "")}"?`,
|
|
},
|
|
{ status: 400 },
|
|
);
|
|
}
|
|
} else {
|
|
// Workspace-level fallback (legacy behaviour)
|
|
if (!norm.startsWith(FS_ROOT) && norm !== FS_ROOT) {
|
|
return NextResponse.json(
|
|
{ error: `Path "${p}" is outside ${FS_ROOT}; use shell.exec for system paths.` },
|
|
{ status: 400 },
|
|
);
|
|
}
|
|
}
|
|
return norm;
|
|
}
|
|
```
|
|
- In every fs_* tool that already calls `resolveProjectOr404`, pass
|
|
the slug:
|
|
```ts
|
|
const path = normalizeFsPath(String(params.path ?? ""), project.slug);
|
|
```
|
|
- `toolFsRead`, `toolFsWrite`, `toolFsEdit`, `toolFsDelete`,
|
|
`toolRequestVisualQA`, `toolGenerateMedia` all already have
|
|
`project` in scope — pass `project.slug`.
|
|
- `toolFsList`, `toolFsGlob`, `toolFsGrep`, `toolFsTree` use
|
|
`params.path` or `params.cwd` — same treatment, pass `project.slug`.
|
|
|
|
**Verify:**
|
|
- [ ] From a project-scoped thread, call `fs_write { path: "/workspace/src/app/page.tsx", content: "x" }`.
|
|
Expect `PATH_OUTSIDE_PROJECT` error.
|
|
- [ ] From the same thread, call `fs_write { path: "src/app/page.tsx", content: "x" }`.
|
|
Expect success at `/workspace/<slug>/src/app/page.tsx`.
|
|
- [ ] Confirm `dev_server_start` with `command: "npm run dev"` runs
|
|
from the project root, not `/workspace`. (This is mostly already
|
|
true via dev-container logic; just confirm.)
|
|
|
|
---
|
|
|
|
### [ ] 3. Fix the broken `plan-extract` block
|
|
|
|
**What:** The fire-and-forget `plan-extract` block in
|
|
`app/api/chat/route.ts` has a syntax error — the `try` block builds a
|
|
transcript and then hits `}` followed by `catch` with no actual call
|
|
to `autoExtractPlanUpdates`. The body of the try is missing. Either
|
|
the auto-extraction was intentionally removed (in which case the
|
|
dead transcript-building code should also be deleted) or it was
|
|
accidentally truncated (in which case the call needs to be restored).
|
|
|
|
**Where:** `app/api/chat/route.ts`, around the second fire-and-forget
|
|
block (after the title/summary block, before `emit({ type: "done" })`).
|
|
|
|
**How:**
|
|
|
|
- Decide first: do we want auto-extraction to run? If YES, restore
|
|
the call:
|
|
```ts
|
|
(async () => {
|
|
try {
|
|
if (!threadProjectId) return;
|
|
const allMessages = [...history, finalMsg];
|
|
if (allMessages.length < 2) return;
|
|
const transcript = allMessages
|
|
.map((m) => {
|
|
const text =
|
|
typeof m.content === "string"
|
|
? m.content
|
|
: JSON.stringify(m.content);
|
|
return `${m.role.toUpperCase()}: ${text.slice(0, 1200)}`;
|
|
})
|
|
.join("\n\n");
|
|
const result = await autoExtractPlanUpdates(
|
|
threadProjectId,
|
|
transcript,
|
|
);
|
|
if (result) {
|
|
console.log(
|
|
"[chat] plan-extract:",
|
|
`${result.tasks} tasks, ${result.decisions} decisions, vision=${result.vision}`,
|
|
);
|
|
}
|
|
} catch (err) {
|
|
console.warn("[chat] plan-extract failed (non-fatal):", err);
|
|
}
|
|
})().catch(() => {});
|
|
```
|
|
And re-add the `import { autoExtractPlanUpdates } from "@/lib/ai/plan-extract";`
|
|
at the top of the file.
|
|
- If NO (you removed it intentionally), delete the entire IIFE
|
|
including the transcript-building so the file compiles cleanly.
|
|
|
|
**Verify:**
|
|
- [ ] Run `tsc --noEmit` on the file. Confirm no syntax errors.
|
|
- [ ] If auto-extraction restored: have a chat that mentions a
|
|
decision (e.g. "let's use Postgres"). Confirm a new entry appears
|
|
in the project's `plan.decisions` with `confidence: "auto"`.
|
|
- [ ] Tail prod logs for `[chat] plan-extract:` — should fire on
|
|
every turn with content.
|
|
|
|
---
|
|
|
|
## Phase 2 — Prompt fixes (now that the backend matches)
|
|
|
|
These bring the prompt into line with what the tools actually do.
|
|
|
|
### [ ] 4. Fix the `apps_containers_list` typo in the prompt
|
|
|
|
**What:** The troubleshooting section references `apps_containers_list`
|
|
but the actual tool is `apps_containers_ps`. The model will call a
|
|
tool that doesn't exist.
|
|
|
|
**Where:** `app/api/chat/route.ts`, inside `buildSystemPrompt`, in the
|
|
"## Troubleshooting" section.
|
|
|
|
**How:**
|
|
- Find: `apps_logs { uuid } + apps_containers_list { uuid }`
|
|
- Replace: `apps_logs { uuid } + apps_containers_ps { uuid }`
|
|
|
|
**Verify:**
|
|
- [ ] Grep the prompt for `apps_containers_list` — no matches.
|
|
- [ ] Grep for `apps_containers_ps` — should appear in
|
|
troubleshooting and at least once in the apps section.
|
|
|
|
---
|
|
|
|
### [ ] 5. Soften the `ok` field rule
|
|
|
|
**What:** The current rule says "If `ok` is false (or absent, or
|
|
`exitCode` is non-zero, or `healthCheck.status` is >= 400) the
|
|
operation FAILED." The "or absent" clause is wrong — many tools
|
|
return data without an `ok` field (e.g. `projects_get`, `apps_list`,
|
|
`databases_get`). The model will treat every read as a failure.
|
|
|
|
**Where:** `buildSystemPrompt`, "Hard rules" section, the "Trust the
|
|
`ok` field" bullet.
|
|
|
|
**How:**
|
|
|
|
Replace the current rule with:
|
|
|
|
```
|
|
- **Read tool results carefully.** A tool FAILED when ANY of these
|
|
signals are present: `ok: false`, `error: "..."`, a non-zero
|
|
`exitCode`, or a `healthCheck.status` >= 400. If NONE of those
|
|
signals are present, look at the actual content of the response
|
|
to decide whether the operation succeeded. Many read-only tools
|
|
return data directly without an `ok` field — that's not a failure.
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] Pick a recent thread where the agent called `projects_get` or
|
|
`apps_list`. Confirm the agent didn't treat the response as a
|
|
failure (look at its post-tool text — should be a normal summary,
|
|
not "the operation failed").
|
|
|
|
---
|
|
|
|
### [ ] 6. Tighten the status-nudge threshold
|
|
|
|
**What:** Current thresholds are `roundsSinceText >= 8 ||
|
|
toolCallsSinceText >= 12`. With `MAX_TOOL_ROUNDS = 8`, the round-based
|
|
nudge can never fire (loop ends first). The tool-call threshold of 12
|
|
is also too lenient — users typed "test" / "hello" by round 4-5 of
|
|
silence in the failing session.
|
|
|
|
**Where:** `app/api/chat/route.ts`, near the top of the main while
|
|
loop, the `isSilent` constant.
|
|
|
|
**How:**
|
|
|
|
```ts
|
|
const isSilent = roundsSinceText >= 3 || toolCallsSinceText >= 6;
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] Replay a chat that triggers 6+ tool calls without text. Confirm
|
|
the `[STATUS NUDGE]` system addendum is injected before the next
|
|
round.
|
|
- [ ] Confirm the model produces a one-line status sentence in
|
|
response to the nudge.
|
|
|
|
---
|
|
|
|
### [ ] 7. Update the path-convention guidance in the prompt
|
|
|
|
**What:** After Fix 2 ships, the path convention is now enforced. The
|
|
prompt should state this plainly without the "cd into your project"
|
|
workaround.
|
|
|
|
**Where:** `buildSystemPrompt`, inside `activeBlock`, the path
|
|
guidance section. Also inside "Dev servers" → "Directory" bullet.
|
|
|
|
**How:**
|
|
|
|
Replace the "Directory" bullet under "Dev servers":
|
|
|
|
```
|
|
- **Directory:** Tool paths are scoped to your project root
|
|
automatically. Pass `command: "npm run dev"` directly — no `cd`
|
|
prefix needed. The tool rejects any `fs_*` write outside
|
|
`/workspace/<slug>/`.
|
|
```
|
|
|
|
And after the "Project repo is auto-cloned" paragraph in
|
|
`activeBlock`, add:
|
|
|
|
```
|
|
**Path convention for fs_* tools:** Pass paths relative to the
|
|
project root — `src/app/page.tsx`, NOT `/workspace/<slug>/src/app/page.tsx`
|
|
and NOT `<slug>/src/app/page.tsx`. The tool layer rejects writes
|
|
outside the project root with a `PATH_OUTSIDE_PROJECT` error
|
|
suggesting the corrected path.
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] In a fresh chat, ask the agent to "edit the homepage". Confirm
|
|
the first `fs_read` call uses `src/app/page.tsx` (no slug prefix,
|
|
no `/workspace/` prefix).
|
|
|
|
---
|
|
|
|
## Phase 3 — Polish and safety nets
|
|
|
|
These are lower-priority but each removes a small foot-gun.
|
|
|
|
### [ ] 8. Add `fs_tree` recommendation to first-turn behavior
|
|
|
|
**What:** The agent ran `fs_tree` 5 times and `fs_glob` 9+ times in
|
|
the failing session, re-discovering paths it should have learned once.
|
|
The tool description already says "ALWAYS call this first" but the
|
|
prompt doesn't reinforce it.
|
|
|
|
**Where:** `buildSystemPrompt`, in the "Writing code" section.
|
|
|
|
**How:**
|
|
|
|
Add this near the top of "Writing code — dev container is the default":
|
|
|
|
```
|
|
**Orient yourself once.** On the first code-modifying turn of a
|
|
chat, call `fs_tree` once to learn the repo layout. Don't re-run it
|
|
on every turn — the layout doesn't change between user messages.
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] Manual review of the next 5 sessions: confirm `fs_tree` is
|
|
called at most once per chat (not per turn).
|
|
|
|
---
|
|
|
|
### [ ] 9. Add `browser_navigate` and `browser_console` as verification primitives
|
|
|
|
**What:** The backend has `browser_navigate` and `browser_console`
|
|
tools that headlessly render a page and capture console errors. The
|
|
prompt never mentions them. These are the missing post-deploy
|
|
verification step that the `healthCheck` field gestures at.
|
|
|
|
**Where:** `buildSystemPrompt`, in the "Dev servers" subsection or as
|
|
its own section after "Visual QA".
|
|
|
|
**How:**
|
|
|
|
Add a new bullet right after "Visual QA" guidance:
|
|
|
|
```
|
|
**Verify the page actually renders:**
|
|
- After `dev_server_start` returns a `previewUrl` AND `healthCheck.status === 200`,
|
|
for any UI-facing turn, call `browser_console { url: previewUrl }` to
|
|
capture frontend console errors. Hydration errors, missing assets,
|
|
and uncaught exceptions show up here even when the server is
|
|
technically "running".
|
|
- If `browser_console` returns errors, fix them with `fs_edit`
|
|
before declaring done. A green `healthCheck` plus a clean console
|
|
is the real "done" signal for UI work.
|
|
- Skip this for backend / SQL / config-only changes.
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] On the next UI-modifying chat, confirm `browser_console` is
|
|
called once after `dev_server_start`.
|
|
- [ ] Confirm any errors it returns get acknowledged in the agent's
|
|
reply.
|
|
|
|
---
|
|
|
|
### [ ] 10. Add the market research stack to the prompt
|
|
|
|
**What:** The backend exposes six market research tools
|
|
(`market_categories_suggest`, `market_research_run`, `market_seo_analyze`,
|
|
`tech_stack_analyze`, `market_competitor_research`,
|
|
`market_aggregate_insights`). The prompt never mentions them. For a
|
|
non-technical-founder product, these are some of the highest-leverage
|
|
tools — they answer "should I build for dentists or summer camps?"
|
|
with real TAM counts.
|
|
|
|
**Where:** `buildSystemPrompt`, add a new section after "Common
|
|
questions → tools" or before "How to deploy".
|
|
|
|
**How:**
|
|
|
|
Add this section:
|
|
|
|
```
|
|
## Helping the user pick what to build
|
|
|
|
Vibn has a market-research toolkit for non-technical founders who
|
|
need data on their target niche. Use it when the user is undecided,
|
|
validating an idea, or comparing markets:
|
|
|
|
- **"How big is the market for X in <location>?"** → `market_categories_suggest { niche }` to
|
|
propose Google Business categories, then `market_research_run` after
|
|
the user approves. Returns TAM count, sample domains, and review
|
|
data. NOTE: `market_research_run` costs real money — always confirm
|
|
with the user and pass `user_explicitly_approved: true`.
|
|
- **"What are competitors spending on Google Ads?"** → `market_seo_analyze { domain }`.
|
|
Returns organic traffic, paid traffic, ad spend, and top paid
|
|
keywords. Use to tell the user how aggressive a market is.
|
|
- **"What software do these businesses already use?"** → `tech_stack_analyze { urls, software_category_id }`.
|
|
Detects WordPress, Shopify, named competitors, and any custom
|
|
domains/scripts you pass. Use to find "X businesses use WordPress
|
|
but lack Y" market gaps.
|
|
- **"What are customers complaining about?"** → `market_aggregate_insights { category, location }`.
|
|
Returns top review topics — use the actual words customers use as
|
|
marketing copy and value-prop seeds.
|
|
- **"Who are the players in this niche?"** → `market_competitor_research { niche }`.
|
|
Returns proprietary competitors with pricing AND open-source
|
|
alternatives that could be forked.
|
|
|
|
These are conversational research tools — they don't build anything.
|
|
Use them BEFORE scaffolding when the user is exploring direction;
|
|
SKIP them once the user has committed to building.
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] In a fresh chat, ask the agent "should I build for dentists or
|
|
summer camps?". Confirm it proposes using `market_categories_suggest`
|
|
or `market_aggregate_insights` rather than guessing.
|
|
|
|
---
|
|
|
|
### [ ] 11. Surface `apps_exec`, `auth_create`, `generate_media`, `storage_*` briefly
|
|
|
|
**What:** Several capabilities the agent has are completely absent
|
|
from the prompt. The agent doesn't know it can run commands inside
|
|
production containers, deploy real auth servers, generate images, or
|
|
wire S3 storage. One-line mentions are enough.
|
|
|
|
**Where:** `buildSystemPrompt`, scattered across existing sections.
|
|
|
|
**How:**
|
|
|
|
- Under "Common questions → tools", add:
|
|
```
|
|
- "Run a migration / psql in prod" → `apps_exec { uuid, command }`.
|
|
- "Generate a hero image / illustration" → `generate_media { prompt, type, outputPath }`.
|
|
- "Wire up file storage / uploads" → `storage_provision` (if not already), then `storage_inject_env { uuid }`.
|
|
```
|
|
|
|
- In the "Decision defaults" section, augment the auth bullet:
|
|
```
|
|
- **Auth:** NextAuth with email magic-link for in-app auth.
|
|
Deploy a separate Pocketbase / Authentik / Keycloak service via
|
|
`auth_create { provider }` only if the user needs SSO, multi-app
|
|
SSO, or admin user management.
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] Ask the agent "can you run a SQL migration on prod?". Confirm
|
|
it references `apps_exec`.
|
|
- [ ] Ask "I need a hero image for the landing page". Confirm it
|
|
references `generate_media`.
|
|
|
|
---
|
|
|
|
### [ ] 12. Flip the `fs_edit` preference to match the tool's documentation
|
|
|
|
**What:** The tool description says `startLine`/`endLine` is
|
|
preferred and `oldString` is the fallback. The prompt currently says
|
|
the opposite ("prefer `oldString` for small replacements"). The tool
|
|
is authoritative — match it.
|
|
|
|
**Where:** `buildSystemPrompt`, in the "Iterate" bullet under "Writing
|
|
code", the `fs_edit` guidance.
|
|
|
|
**How:**
|
|
|
|
Replace the current `fs_edit` guidance with:
|
|
|
|
```
|
|
- `fs_read` / `fs_write` / `fs_edit { path, oldString, newString, startLine, endLine }`.
|
|
**For `fs_edit`:** prefer `startLine`/`endLine` (deterministic; never
|
|
fails on duplicate strings). Use `oldString` only when you cannot
|
|
read the file first to get line numbers — and when you do, include
|
|
2-3 lines of surrounding context for uniqueness. If `fs_edit` keeps
|
|
failing, do NOT escape to `shell_exec` with patch scripts — read
|
|
the file fresh with `fs_read`, get the line numbers, and try again.
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] Confirm next 3 `fs_edit` calls in the wild use `startLine`/`endLine`,
|
|
not `oldString`.
|
|
|
|
---
|
|
|
|
## Phase 4 — Monitoring (after Phases 1-3 land)
|
|
|
|
Once the changes are in production, watch these for a week. Tune if
|
|
the numbers don't move.
|
|
|
|
### [ ] 13. Track the recovery-summary fire rate
|
|
|
|
**What:** The `needsRecovery` path runs when a turn ends badly (hit
|
|
the round cap, hit a loop, or last tool returned failure). It should
|
|
fire on <10% of turns. If it fires more often, the cap is too low or
|
|
the model is hitting real bugs.
|
|
|
|
**How:** Add a metric. In the `needsRecovery` block, before calling
|
|
`callVibnChat`, emit:
|
|
|
|
```ts
|
|
console.log("[chat] recovery_fired", {
|
|
turnId,
|
|
reason: loopBreakReason ? "loop"
|
|
: round >= MAX_TOOL_ROUNDS ? "round_cap"
|
|
: assistantText.trim().length === 0 ? "no_text"
|
|
: "tool_failure",
|
|
toolCalls: assistantToolCalls.length,
|
|
});
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] Aggregate over 1 week. If `round_cap` is the dominant reason
|
|
and the turns look legitimate, raise `MAX_TOOL_ROUNDS` to 10 or 12.
|
|
- [ ] If `loop` is dominant, the fingerprinter may need tuning.
|
|
|
|
---
|
|
|
|
### [ ] 14. Track conversational-guard fire rate
|
|
|
|
**What:** The first-turn conversational guard (regex match on user
|
|
message → no tools on round 1) is the biggest single behavioral
|
|
change in this revision. We want to know how often it fires and
|
|
whether it ever causes a problem.
|
|
|
|
**How:** Before the loop, after computing `firstMessageIsConversational`:
|
|
|
|
```ts
|
|
console.log("[chat] turn_start", {
|
|
turnId,
|
|
firstMessageIsConversational,
|
|
messagePreview: message.trim().slice(0, 80),
|
|
});
|
|
```
|
|
|
|
**Verify:**
|
|
- [ ] Aggregate over 1 week. Should fire on ~30-40% of first messages
|
|
in a chat.
|
|
- [ ] Spot-check 10 cases where it fired: confirm none were
|
|
legitimate "build me X" requests being miscategorized.
|
|
|
|
---
|
|
|
|
### [ ] 15. Track `PATH_OUTSIDE_PROJECT` rejections
|
|
|
|
**What:** After Fix 2 ships, this error tells you how often the model
|
|
was about to write to the wrong path. Should taper toward zero as the
|
|
prompt guidance + enforcement settles in.
|
|
|
|
**How:** Server-side log in `normalizeFsPath` when the project-scoped
|
|
rejection fires.
|
|
|
|
**Verify:**
|
|
- [ ] Week 1: count rejections per day.
|
|
- [ ] Week 2: count should be lower (model has internalized the rule
|
|
via the error messages it's seen).
|
|
|
|
---
|
|
|
|
## What this checklist deliberately doesn't include
|
|
|
|
A few things from earlier reviews that I'm intentionally leaving off:
|
|
|
|
- **Phase-aware behavior.** Phases were removed from the product; the
|
|
prompt no longer references them. No work needed.
|
|
- **Codebase summary auto-generation.** Lower priority once Fix 2
|
|
ships (paths can no longer drift) and Fix 8 lands (one `fs_tree`
|
|
per chat instead of nine).
|
|
- **Tool history reconstruction for DeepSeek compatibility.** Already
|
|
shipped in the current code via the `_rawToolResults` compact
|
|
summary. No additional work.
|
|
|
|
Don't add work for these unless a clear failure mode appears in
|
|
production after Phases 1-3 land. |