fix(preview): stop refresh-flicker false-restarts + harden dev container & agent

- isDevServerListening: key off curl EXIT CODE not response time. The 2s
  max-time treated a busy/compiling-but-listening dev server as DEAD, so ensure
  restarted a healthy server on every refresh -> cold compile -> the
  502/no-CSS/broken-images/perfect flicker. Now dead only when BOTH localhost
  and 0.0.0.0 refuse the connection (curl exit 7).
- ensure route: liveness probe is fail-safe (try/catch) -> never 500s or
  needlessly restarts on a probe error; trusts the DB flag instead.
- dev container: reconcile dead orphan containers before resume/start so a
  leftover name no longer triggers 'container name already in use' -> Traefik
  gateway timeout.
- dev container: inject AUTH_SECRET / NEXTAUTH_SECRET / AUTH_TRUST_HOST so
  scaffolded NextAuth apps stop throwing [auth][error] MissingSecret in preview.
- chat prompt: don't bounce a healthy dev server; only claim actions a tool
  actually performed (no hallucinated DB deletes); NextAuth previews pre-wired.
- intent budgets: route 'not appearing/showing/missing' to diagnose; bump
  status_check 12->16, diagnose 15->22 so investigations don't hit the cap.
This commit is contained in:
2026-06-12 18:05:16 -07:00
parent 514f11e80d
commit 0f212c750b
3 changed files with 81 additions and 13 deletions

View File

@@ -58,8 +58,8 @@ const TOOL_BUDGETS: Record<TurnIntent, number> = {
// 5/8 were cutting these off at the cap before the model could answer // 5/8 were cutting these off at the cap before the model could answer
// (telemetry showed 100% round_cap on these turns). Raised so a read-only // (telemetry showed 100% round_cap on these turns). Raised so a read-only
// investigation can actually finish. // investigation can actually finish.
status_check: 12, status_check: 16,
diagnose: 15, diagnose: 22,
small_fix: 18, small_fix: 18,
feature_build: 40, feature_build: 40,
deploy: 25, deploy: 25,
@@ -94,7 +94,7 @@ function classifyTurnIntent(message: string): TurnIntent {
// a 40-round build task. Without these, "I get a gateway timeout" falls through // a 40-round build task. Without these, "I get a gateway timeout" falls through
// to feature_build and burns 40 rounds looping on a dead dev server. // to feature_build and burns 40 rounds looping on a dead dev server.
if ( if (
/(why|broken|error|blank|not loading|fail|bug|issue|doesn't work|isn't working|fix|time.?out|tim(?:es?|ed|ing) out|gateway|5[0-2][0-9]|connection (refused|reset|failed)|unreachable|can.?t connect|cannot connect|not respond)/.test( /(why|broken|error|blank|not loading|not (showing|appearing|visible|rendering)|isn'?t (showing|appearing|visible|rendering)|missing|disappeared|gone|won'?t load|fail|bug|issue|doesn't work|isn't working|fix|time.?out|tim(?:es?|ed|ing) out|gateway|5[0-2][0-9]|connection (refused|reset|failed)|unreachable|can.?t connect|cannot connect|not respond)/.test(
m, m,
) )
) )
@@ -427,10 +427,10 @@ If the user tells you the preview is blank, not loading, or shows nothing:
**HMR through the proxy (apply when scaffolding):** **HMR through the proxy (apply when scaffolding):**
- **Vite (verified working):** in \`vite.config\` set \`server: { host: '0.0.0.0', port: <3000-3009>, strictPort: true, hmr: { clientPort: 443, protocol: 'wss', host: '<the previewUrl host, no protocol>' } }\`. The \`hmr.host\` is REQUIRED — without it Vite's HMR client can guess the wrong host and the WS handshake fails through Traefik. Default localhost binding looks fine locally but breaks HMR through the proxy. - **Vite (verified working):** in \`vite.config\` set \`server: { host: '0.0.0.0', port: <3000-3009>, strictPort: true, hmr: { clientPort: 443, protocol: 'wss', host: '<the previewUrl host, no protocol>' } }\`. The \`hmr.host\` is REQUIRED — without it Vite's HMR client can guess the wrong host and the WS handshake fails through Traefik. Default localhost binding looks fine locally but breaks HMR through the proxy.
- **Next dev:** \`next dev -H 0.0.0.0 --no-turbopack\` (WSS HMR works automatically through the proxy without extra config). **Always use \`--no-turbopack\`** — Turbopack\'s per-route lazy compilation causes cold-start 503s in the remote container (the health probe passes on \`/\` but unvisited routes hang on first hit until Turbopack compiles them). webpack compiles all routes upfront and is significantly more stable in a containerised environment. - **Next dev:** \`npx next dev -H 0.0.0.0 --no-turbopack\` (WSS HMR works automatically through the proxy without extra config). **Always use \`--no-turbopack\`** — Turbopack\'s per-route lazy compilation causes cold-start 503s in the remote container (the health probe passes on \`/\` but unvisited routes hang on first hit until Turbopack compiles them). webpack compiles all routes upfront and is significantly more stable in a containerised environment.
- **Express / plain Node:** bind \`0.0.0.0\` (we set \`HOST=0.0.0.0\` env, but verify your framework respects it). - **Express / plain Node:** bind \`0.0.0.0\` (we set \`HOST=0.0.0.0\` env, but verify your framework respects it).
**Build-me-X recipe:** \`devcontainer_ensure\`\`apps_templates_scaffold { templateName }\` (if matching "dashboard" or "pitch-deck") OR \`shell_exec npx create-next-app@latest . --yes\`\`fs_edit\` / \`fs_write\` to customize → **wire Sentry (see below)** → \`dev_server_start { command: 'next dev -H 0.0.0.0 --no-turbopack', port: 3000 }\` and **share the previewUrl in your reply — that's the turn's stopping point**. When the user says "ship it", call \`ship { projectId, commitMsg }\` (commits to Gitea and triggers prod deploy in one shot). If a project is multi-service (frontend + API + worker), pick the user-facing service (usually the frontend) and start ITS dev server first, even if the others aren't done yet — a clickable shell beats a complete-but-invisible stack. **Build-me-X recipe:** \`devcontainer_ensure\`\`apps_templates_scaffold { templateName }\` (if matching "dashboard" or "pitch-deck") OR \`shell_exec npx create-next-app@latest . --yes\`\`fs_edit\` / \`fs_write\` to customize → **wire Sentry (see below)** → \`dev_server_start { command: 'npx next dev -H 0.0.0.0 --no-turbopack', port: 3000 }\` and **share the previewUrl in your reply — that's the turn's stopping point**. When the user says "ship it", call \`ship { projectId, commitMsg }\` (commits to Gitea and triggers prod deploy in one shot). If a project is multi-service (frontend + API + worker), pick the user-facing service (usually the frontend) and start ITS dev server first, even if the others aren't done yet — a clickable shell beats a complete-but-invisible stack.
**Sentry is auto-provisioned per Vibn project.** When you scaffold a Next.js or Vite app, wire Sentry from day one so the user gets de-minified error capture + Session Replay on first deploy. The DSN (\`NEXT_PUBLIC_SENTRY_DSN\`) and shared org auth token (\`SENTRY_AUTH_TOKEN\`) are injected into the Coolify app's env automatically by \`apps_create\` — you don't set them. Get the project's Sentry slug from \`projects_get { projectId }\` (field: \`sentry.slug\`); pass it to \`withSentryConfig({ org: "vibnai", project: "<slug>", ... })\`. The reference recipe (instrumentation.ts, instrumentation-client.ts, app/global-error.tsx, next.config.ts wrapper, Dockerfile ARG declarations) is in \`vibn-frontend/lib/scaffold/sentry-snippets.ts\` — read it once via \`fs_*\` if you're unsure, then copy the snippets into the user's project verbatim. Skip Sentry for non-app projects (CLIs, library-only repos). **Sentry is auto-provisioned per Vibn project.** When you scaffold a Next.js or Vite app, wire Sentry from day one so the user gets de-minified error capture + Session Replay on first deploy. The DSN (\`NEXT_PUBLIC_SENTRY_DSN\`) and shared org auth token (\`SENTRY_AUTH_TOKEN\`) are injected into the Coolify app's env automatically by \`apps_create\` — you don't set them. Get the project's Sentry slug from \`projects_get { projectId }\` (field: \`sentry.slug\`); pass it to \`withSentryConfig({ org: "vibnai", project: "<slug>", ... })\`. The reference recipe (instrumentation.ts, instrumentation-client.ts, app/global-error.tsx, next.config.ts wrapper, Dockerfile ARG declarations) is in \`vibn-frontend/lib/scaffold/sentry-snippets.ts\` — read it once via \`fs_*\` if you're unsure, then copy the snippets into the user's project verbatim. Skip Sentry for non-app projects (CLIs, library-only repos).
@@ -491,6 +491,9 @@ The project's requirements, features list, specifications, and backlog checklist
- Long-running ops (deploys, DNS, DB provisioning) take 15 min — tell the user up front. Don't tight-loop polling. - Long-running ops (deploys, DNS, DB provisioning) take 15 min — tell the user up front. Don't tight-loop polling.
- After \`ship\` or \`apps_deploy\`, the result is authoritative. Don't call \`gitea_*\` / \`shell_exec\` / \`apps_*\` to "verify" — read the response and report. - After \`ship\` or \`apps_deploy\`, the result is authoritative. Don't call \`gitea_*\` / \`shell_exec\` / \`apps_*\` to "verify" — read the response and report.
- Never fake success. Never imply something worked if it didn't. - Never fake success. Never imply something worked if it didn't.
- **Don't bounce a healthy dev server.** If \`dev_server_list\` or a \`healthCheck.status === 200\` shows it serving, the server is HEALTHY — a user saying "I don't see it," a cached 502, or a transient client timeout is NOT a reason to \`dev_server_stop\`/restart. Restarting a working server churns the container and CAUSES the gateway timeouts you're trying to fix. Re-verify by fetching the actual preview URL first; only restart if the port genuinely isn't answering.
- **Only claim what a tool actually did.** If you lack a tool for what the user asked (e.g. deleting a Vibn platform-provisioned database — \`databases_delete\` only removes Coolify-managed DBs and returns \`[]\` for platform DBs), say so plainly and stop. NEVER report a delete / create / merge / deploy as done without a matching success result from THIS turn. Claiming you deleted something you cannot delete destroys trust faster than any bug.
- **NextAuth previews are pre-wired.** The dev container already injects \`AUTH_SECRET\` / \`NEXTAUTH_SECRET\` / \`AUTH_TRUST_HOST\`, so never add MissingSecret hacks in the preview. When you add a page or scaffold auth, make sure \`src/middleware.ts\` does not redirect the new route to \`/login\`: protect by an explicit public-route allowlist (\`/\`, the new page, \`/api/auth/*\`, static assets), not block-everything.
${activeBlock}${briefBlock}## Current workspace projects ${activeBlock}${briefBlock}## Current workspace projects
${projectsText} ${projectsText}

View File

@@ -92,7 +92,21 @@ export async function POST(
// which is the #1 cause of "preview was up, now it's a 502". Only return // which is the #1 cause of "preview was up, now it's a 502". Only return
// `running` if the port truly answers; otherwise fall through and resurrect. // `running` if the port truly answers; otherwise fall through and resurrect.
if (active?.state === "running") { if (active?.state === "running") {
const alive = await isDevServerListening(projectId, active.port); // Fail SAFE: if the liveness probe itself errors (SSH down, transient
// hiccup, etc.) we must NOT 500 the route or needlessly restart a server
// that may be fine — either would break/flicker the preview. Default to
// trusting the DB flag and only treat the server as dead on a definitive
// "not listening" result.
let alive = true;
try {
alive = await isDevServerListening(projectId, active.port);
} catch (err) {
console.error(
"[dev-server/ensure] liveness probe errored; trusting DB state:",
err instanceof Error ? err.message : err,
);
alive = true;
}
if (alive) { if (alive) {
return NextResponse.json({ return NextResponse.json({
status: "running", status: "running",

View File

@@ -31,7 +31,8 @@ import {
getService, getService,
} from "@/lib/coolify"; } from "@/lib/coolify";
import { execInCoolifyApp, type ExecInAppResult } from "@/lib/coolify-exec"; import { execInCoolifyApp, type ExecInAppResult } from "@/lib/coolify-exec";
import { isCoolifySshConfigured } from "@/lib/coolify-ssh"; import { isCoolifySshConfigured, runOnCoolifyHost } from "@/lib/coolify-ssh";
import { createHash } from "node:crypto";
import { import {
ensureProjectCoolifyProject, ensureProjectCoolifyProject,
getProjectCoolifyUuid, getProjectCoolifyUuid,
@@ -143,6 +144,36 @@ function projectPreviewToken(projectId: string): string {
return Buffer.from(projectId).toString("hex").slice(0, 8); return Buffer.from(projectId).toString("hex").slice(0, 8);
} }
// Deterministic, hard-to-guess secret per project, injected into the dev
// container so scaffolded NextAuth apps never throw `[auth][error] MissingSecret`
// in the PREVIEW. This is for dev/preview only — production auth uses the real
// AUTH_SECRET on the deployed Coolify app, never this. Deterministic so it
// survives container restarts without needing a DB column.
function devAuthSecret(projectId: string): string {
const salt = process.env.VIBN_DEV_AUTH_SALT ?? "vibn-dev-auth-v1";
return createHash("sha256").update(`${salt}:${projectId}`).digest("hex");
}
// Before (re)starting a dev container, clear any DEAD orphan container that is
// still holding this service's Coolify-assigned name. Coolify names every
// container of a resource with the resource uuid as a suffix (e.g.
// `vibn-dev-<uuid>`); a prior suspend/deploy can leave an exited container under
// that name, so the next start fails with "Conflict. The container name … is
// already in use" and Traefik loses its backend (the user sees a gateway
// timeout). We remove ONLY non-running containers (exited/created/dead) — never
// a live one — so a healthy container is never killed. Best-effort + SSH-gated.
async function reconcileDevContainerOrphans(
serviceUuid: string,
): Promise<void> {
if (!serviceUuid || !isCoolifySshConfigured()) return;
const nameFilter = `name=-${serviceUuid}$`;
const cmd =
`docker ps -a --filter '${nameFilter}' ` +
`--filter status=exited --filter status=created --filter status=dead -q ` +
`| xargs -r docker rm -f`;
await runOnCoolifyHost(cmd, { timeoutMs: 15_000 }).catch(() => {});
}
function renderDevCompose(projectSlug: string, projectId: string): string { function renderDevCompose(projectSlug: string, projectId: string): string {
// Image distribution: we build vibn-dev on the Coolify host once // Image distribution: we build vibn-dev on the Coolify host once
// (see /vibn-dev/setup-on-coolify.sh) and reference it locally. // (see /vibn-dev/setup-on-coolify.sh) and reference it locally.
@@ -196,6 +227,13 @@ function renderDevCompose(projectSlug: string, projectId: string): string {
- VIBN_PROJECT_ID=${projectId} - VIBN_PROJECT_ID=${projectId}
- VIBN_PREVIEW_TOKEN=${token} - VIBN_PREVIEW_TOKEN=${token}
- VIBN_DEV_CONTAINER=1 - VIBN_DEV_CONTAINER=1
# Make scaffolded NextAuth apps work in the preview out of the box.
# AUTH_SECRET (NextAuth v5) / NEXTAUTH_SECRET (v4) prevent the
# "[auth][error] MissingSecret" crash; AUTH_TRUST_HOST lets v5 trust the
# Traefik-proxied preview host. Dev/preview only — prod uses its own secret.
- AUTH_SECRET=${devAuthSecret(projectId)}
- NEXTAUTH_SECRET=${devAuthSecret(projectId)}
- AUTH_TRUST_HOST=true
networks: networks:
- vibn-dev-net - vibn-dev-net
- coolify - coolify
@@ -386,6 +424,9 @@ export async function resumeDevContainer(projectId: string): Promise<void> {
const row = await getDevContainerRow(projectId); const row = await getDevContainerRow(projectId);
if (!row) return; if (!row) return;
if (row.state === "running") return; if (row.state === "running") return;
// Clear any dead orphan holding the container name so the start can't fail
// with a "container name already in use" conflict (which strands Traefik).
await reconcileDevContainerOrphans(row.service_uuid);
await startService(row.service_uuid); await startService(row.service_uuid);
await query( await query(
`UPDATE fs_project_dev_containers `UPDATE fs_project_dev_containers
@@ -808,15 +849,25 @@ export async function isDevServerListening(
port: number, port: number,
): Promise<boolean> { ): Promise<boolean> {
try { try {
// CRITICAL: distinguish "port not listening" (truly dead) from "listening but
// slow to respond" (ALIVE — a Next.js/Vite dev server mid route-compile can
// take many seconds to answer `/`). We must NOT treat slowness as death:
// doing so made `ensure` restart a healthy-but-busy server on every refresh,
// and each restart cold-compiles, flickering the preview through
// 502 -> no-CSS -> broken-images -> perfect.
//
// We therefore key off curl's EXIT CODE, not response time. Exit 7
// (CURLE_COULDNT_CONNECT) is the only definitive "nothing is bound to this
// port" signal. A slow response yields exit 28 (timeout) / 52 / 56 etc., all
// of which mean the socket accepted us => the server is up. Dead only when
// BOTH localhost and 0.0.0.0 refuse the connection.
const r = await execInDevContainer({ const r = await execInDevContainer({
projectId, projectId,
command: command:
`code=$(curl -sS -o /dev/null -w '%{http_code}' --max-time 2 --connect-timeout 2 ` + `curl -s -o /dev/null --connect-timeout 2 --max-time 4 "http://localhost:${port}/" 2>/dev/null; a=$?; ` +
`"http://localhost:${port}/" 2>/dev/null || ` + `curl -s -o /dev/null --connect-timeout 2 --max-time 4 "http://0.0.0.0:${port}/" 2>/dev/null; b=$?; ` +
`curl -sS -o /dev/null -w '%{http_code}' --max-time 2 --connect-timeout 2 ` + `if [ "$a" = "7" ] && [ "$b" = "7" ]; then echo DEAD; else echo LIVE; fi`,
`"http://0.0.0.0:${port}/" 2>/dev/null || printf '000'); ` + timeoutMs: 12_000,
`[ "$code" != "000" ] && [ -n "$code" ] && echo LIVE || echo DEAD`,
timeoutMs: 8_000,
}); });
return /LIVE/.test(r.stdout); return /LIVE/.test(r.stdout);
} catch { } catch {