fix(devcontainer): self-heal stuck provisioning state, stop AI poll-loop

Before this fix, devcontainer.status was a read-only DB query that
returned whatever state the row currently held. The state only flips
provisioning→running via touchActivity() inside execInDevContainer.
That created a deadlock: the AI polls devcontainer.status waiting
for 'running'; status will never flip until something else execs.
Caught live in smoke test 2026-05-01 (manifest project) — the AI
fired devcontainer.status three times in a row, hit the loop guard,
and surfaced the dead-end to the user.

Two fixes:

1. getDevContainerStatus() now does a cheap 'true' exec probe when
   the row says 'provisioning'. If the probe lands, it flips the
   row to 'running' via touchActivity and reports selfHealed=true.
   If the probe fails AND the row is older than 120s, it reports
   likelyFailed=true so callers can stop polling and escalate.
   Also returns ageSeconds for the AI to reason about wait windows.
   Coolify's own service status is not used because dev containers
   have no fqdn/healthcheck and Coolify reports running:unknown for
   any such service forever.

2. New error-recovery rule 'devcontainer-still-provisioning' that
   fires whenever a status response contains state:'provisioning'.
   Tells the AI to send one status message, wait 15s, and prefer
   shell.exec (which lazy-provisions and proves reachability) over
   another devcontainer.status call. Explicit antipattern: do not
   poll status in a tight loop.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
2026-05-01 13:46:23 -07:00
parent 7ce8909555
commit 8c5fbad782
2 changed files with 75 additions and 3 deletions

View File

@@ -434,12 +434,70 @@ export async function getDevContainerStatus(projectId: string): Promise<{
exists: boolean;
state: DevContainerRow['state'] | 'absent';
serviceUuid: string | null;
/** Seconds since the row was created; useful for AI to decide whether to keep polling. */
ageSeconds?: number;
/** Set when state was just self-healed by this call. */
selfHealed?: boolean;
/** Set when state is stuck in provisioning past the grace window (likely failed). */
likelyFailed?: boolean;
}> {
const row = await getDevContainerRow(projectId);
if (!row) return { exists: false, state: 'absent', serviceUuid: null };
// Optional: poke Coolify for fresh state. Skipped for now to keep this
// hot path cheap; consumers that care can call getService(uuid) directly.
return { exists: true, state: row.state, serviceUuid: row.service_uuid };
const ageMs = Date.now() - row.created_at.getTime();
const ageSeconds = Math.floor(ageMs / 1000);
// If we already think it's running or suspended, return as-is. The
// touchActivity() call inside execInDevContainer keeps the row honest.
if (row.state !== 'provisioning') {
return { exists: true, state: row.state, serviceUuid: row.service_uuid, ageSeconds };
}
// State is 'provisioning'. The naive read-only return here used to
// create a deadlock: the AI polls status forever waiting for a flip
// that only happens via execInDevContainer. So instead, probe with
// a cheap `true` exec. If it succeeds, mark running and return.
// Coolify's service status alone isn't enough — Coolify reports
// 'running:unknown' for any service without a healthcheck/fqdn,
// which is every dev container. The exec is the source of truth.
if (isCoolifySshConfigured()) {
try {
const probe = await execInCoolifyApp({
appUuid: row.service_uuid,
service: 'vibn-dev',
command: 'true',
user: 'vibn',
timeoutMs: 5_000,
});
if (probe.exitCode === 0) {
await touchActivity(projectId);
return {
exists: true,
state: 'running',
serviceUuid: row.service_uuid,
ageSeconds,
selfHealed: true,
};
}
} catch {
// Exec failed — container probably not yet up. Fall through
// to age-based likelyFailed heuristic.
}
}
// If we've been "provisioning" for >120s, the container is almost
// certainly stuck (image pull failure, scheduling failure, etc.).
// Surface that distinct from "still booting" so the AI can stop
// polling and tell the user instead of looping.
const likelyFailed = ageSeconds > 120;
return {
exists: true,
state: row.state,
serviceUuid: row.service_uuid,
ageSeconds,
likelyFailed,
};
}
// Re-export getService so route handlers can pull live Coolify status