fix(ai): stop the AI from forking duplicate services to escape errors

Three changes that compound to fix the "4 orphan twenty-* services"
problem we just hit:

1. apps_create is now idempotent within a project. If a service from
   the same template already exists in the same Vibn projectId, return
   it with alreadyExisted: true instead of creating a clone. Pass
   { force: true } to opt out for legitimate dev/staging duplicates.

2. New apps_unstick tool. SSH-cleans orphan Docker containers
   matching the resource UUID so a deploy that hit "Conflict.
   The container name X is already in use" can recover without
   deleting the entire service.

3. System prompt hardened with two new hard rules:
   - ALWAYS apps_list before apps_create (idempotency in spirit, not
     just at the API boundary)
   - NEVER delete-and-recreate a service to escape an error. The
     recovery for container conflicts is apps_unstick + apps_deploy.

Already cleaned the 3 duplicate twenty-* services from prod
(kept twenty-live, freshest healthy). Frees ~9 GB RAM on the host.

Made-with: Cursor
This commit is contained in:
2026-04-29 20:27:52 -07:00
parent 14d0b04112
commit 3d525afdf7
3 changed files with 172 additions and 0 deletions

View File

@@ -233,7 +233,10 @@ When you write to Plan, the user does NOT need a long acknowledgment. "Logged th
## Hard rules (non-negotiable)
- ALWAYS pass \`projectId\` to \`apps_create\` and \`databases_create\`. If the user didn't say which project, infer from context (active project, last-mentioned, only one in workspace) — only ask if genuinely ambiguous.
- ALWAYS call \`apps_list { projectId }\` BEFORE \`apps_create\` to check if the thing already exists. \`apps_create\` is idempotent within a project (returns \`alreadyExisted: true\` for duplicate templates), but you should check first so the user sees you being thoughtful — not "deploy stuff and hope."
- ALWAYS call \`apps_templates_search\` BEFORE \`apps_create\` when the user names a known third-party app. Hand-rolling a Dockerfile when a maintained template exists is how supply-chain bugs ship.
- **NEVER delete-and-recreate a service to escape an error.** When a deploy fails with "Conflict. The container name … is already in use" or any orphan-container symptom, the recovery is: \`apps_unstick { uuid }\`\`apps_deploy { uuid }\`. Deleting the service to side-step the conflict creates a new uuid with new container names AND leaves the orphan running AND forks a duplicate stack. We've shipped 4 orphan twenty-* services this way before. Don't repeat it.
- **If a deploy fails twice in a row with the same error, STOP.** Don't loop. Surface the error and the two recovery attempts you've already tried, and ask the user how to proceed.
- Destructive ops (\`*_delete\`, \`*_volumes_wipe\`) require \`confirm\` equal to the resource's exact name. Always fetch the name first with a \`*_get\` call. Confirm with the user before executing irreversible deletes unless they explicitly said "delete X".
- Long-running ops (deploys, DNS provisioning, db provisioning) take 15 min. Tell the user up front so they don't think you're stuck. Don't poll in a tight loop — it wastes tool rounds.
- After a \`ship\` or \`apps.deploy\`, the result is authoritative. Don't call gitea_*, shell_exec, or apps_* to "verify" — read the response and report.

View File

@@ -293,6 +293,8 @@ export async function POST(request: Request) {
return await toolAppsContainersPs(principal, params);
case 'apps.repair':
return await toolAppsRepair(principal, params);
case 'apps.unstick':
return await toolAppsUnstick(principal, params);
case 'apps.templates.list':
return await toolAppsTemplatesList(params);
case 'apps.templates.search':
@@ -1298,6 +1300,34 @@ async function toolAppsCreate(principal: Principal, params: Record<string, any>)
}, { status: 404 });
}
// ── Idempotency: don't fan out duplicate services into the same
// Coolify project. If a service with the same template already
// exists, return it instead of creating a 4th twenty-* clone.
// Use force=true to bypass dedup only when the caller really wants
// multiple instances (e.g. dev/staging copies of the same template).
const force = params.force === true || params.force === 'true';
if (params.projectId && !force) {
const existing = await findExistingTemplateService(
targetCoolifyProjectUuid,
templateSlug,
);
if (existing) {
await linkIfRequested(existing.uuid, 'service');
return NextResponse.json({
result: {
uuid: existing.uuid,
name: existing.name,
template: templateSlug,
alreadyExisted: true,
summaryHint:
`A "${templateSlug}" service already exists in this project as "${existing.name}" (uuid ${existing.uuid}). Returning it instead of creating a duplicate. ` +
`If the user wanted a SECOND independent instance, re-call apps_create with { force: true }. ` +
`If the existing one is broken, call apps_unstick { uuid } and then apps_deploy { uuid } — DO NOT delete-and-recreate.`,
},
});
}
}
const appName = slugify(String(params.name ?? templateSlug));
const fqdn = resolveFqdn(params.domain, ws.slug, appName);
if (fqdn instanceof NextResponse) return fqdn;
@@ -3858,3 +3888,121 @@ async function toolPlanDecisionLog(principal: Principal, params: Record<string,
await writePlanForProject(projectId, plan);
return NextResponse.json({ result: { ok: true, decision, summaryHint: `Decision logged to Plan → Decisions. Brief acknowledgment only.` } });
}
// ── Idempotency + unstick helpers ─────────────────────────────────────────────
/**
* Find an existing service in the given Coolify project that was created
* from the same template. Used to short-circuit duplicate apps_create
* calls when the AI doesn't realize Twenty CRM is already deployed.
*/
async function findExistingTemplateService(
coolifyProjectUuid: string,
templateSlug: string,
): Promise<{ uuid: string; name: string } | null> {
try {
const services = await listServicesInProject(coolifyProjectUuid);
for (const s of services) {
// Coolify stores the template slug as `service_type` on the row.
// Some older services may have it under `type`. Match either.
const type =
(s as any).service_type ||
(s as any).type ||
(typeof (s as any).docker_compose_raw === 'string' &&
(s as any).docker_compose_raw.includes(templateSlug)
? templateSlug
: null);
if (type === templateSlug) {
return { uuid: s.uuid, name: s.name };
}
}
} catch (e) {
console.warn('[findExistingTemplateService] failed', e);
}
return null;
}
/**
* apps.unstick — recover a service stuck on a "container name already
* in use" Docker conflict. Force-removes the orphan containers (and
* optionally their volumes), then returns. Caller should then re-call
* apps.deploy to bring the stack back up.
*
* This is the RIGHT recovery path. The WRONG one (and what the AI was
* doing before the system-prompt update) is to delete the service and
* recreate a new one with a fresh UUID — which side-steps the conflict
* by creating new container names but leaves the orphan running and
* forks a duplicate copy of the stack.
*/
async function toolAppsUnstick(principal: Principal, params: Record<string, any>) {
const uuid = String(params.uuid ?? '').trim();
if (!uuid) return NextResponse.json({ error: 'uuid required' }, { status: 400 });
if (!isCoolifySshConfigured()) {
return NextResponse.json({
error:
'Coolify SSH is not configured on this deploy. Cannot reach the host to clean orphan containers.',
}, { status: 503 });
}
// Resolve the resource to confirm tenancy + grab its name.
let resourceName = '';
let kind: 'application' | 'service' | 'database' = 'application';
try {
const app = await getApplicationInWorkspace(uuid, principal.workspace);
if (app) { resourceName = app.name; kind = 'application'; }
} catch {}
if (!resourceName) {
try {
const svc = await getServiceInWorkspace(uuid, principal.workspace);
if (svc) { resourceName = svc.name; kind = 'service'; }
} catch {}
}
if (!resourceName) {
try {
const db = await getDatabaseInWorkspace(uuid, principal.workspace);
if (db) { resourceName = db.name; kind = 'database'; }
} catch {}
}
if (!resourceName) {
return NextResponse.json({ error: 'Resource not found in this workspace' }, { status: 404 });
}
const wipeVolumes = params.wipeVolumes === true || params.wipeVolumes === 'true';
// All Coolify-managed containers for a resource carry its UUID as a
// suffix on the container name (e.g. postgres-<uuid>, twenty-<uuid>,
// worker-<uuid>). One docker rm -f against any name ending in -<uuid>
// catches every container in the stack.
const filter = `name=-${uuid}$`;
const cmd = wipeVolumes
? `docker ps -a --filter '${filter}' -q | xargs -r docker rm -f -v`
: `docker ps -a --filter '${filter}' -q | xargs -r docker rm -f`;
let removed: string[] = [];
let stderr = '';
try {
const result = await runOnCoolifyHost(
`docker ps -a --filter '${filter}' --format '{{.Names}}' | tee /tmp/unstick-${uuid}.txt; ` + cmd
);
removed = (result.stdout || '').split('\n').filter(Boolean).filter((l) => l.includes(`-${uuid}`));
stderr = result.stderr || '';
} catch (e) {
return NextResponse.json({
error: `Failed to clean orphan containers: ${e instanceof Error ? e.message : String(e)}`,
}, { status: 500 });
}
return NextResponse.json({
result: {
uuid,
name: resourceName,
kind,
removedContainers: removed,
wipeVolumes,
stderr: stderr || undefined,
summaryHint:
removed.length === 0
? `No orphan containers found for ${resourceName} (uuid ${uuid}). The conflict may be elsewhere — check apps_logs.`
: `Cleaned ${removed.length} orphan container(s) for ${resourceName}: ${removed.join(', ')}. Now call apps_deploy { uuid: "${uuid}" } to bring the stack back up. Do NOT delete the service.`,
},
});
}

View File

@@ -244,6 +244,27 @@ Auto-domain {name}.{workspace}.vibnai.com is assigned automatically.`,
required: ['uuid', 'fqdn', 'publicAppName'],
},
},
{
name: 'apps_unstick',
description: `Recover a service stuck on a Docker "container name already in use" conflict. Force-removes orphan containers (everything matching name suffix -<uuid>) so the next apps_deploy can boot clean.
USE THIS — DO NOT delete-and-recreate the service. Deleting and re-creating produces a NEW uuid + NEW container names, which side-steps the conflict but leaves the orphan running AND forks a duplicate copy of the stack. We've burned ourselves on this before (4 orphan twenty-* services, 12GB RAM eaten).
Recipe when a deploy fails with "Conflict. The container name X is already in use":
1. apps_unstick { uuid: "<service-uuid>" }
2. apps_deploy { uuid: "<service-uuid>" }
3. apps_get { uuid: "<service-uuid>" } to confirm fqdn / status.
Pass wipeVolumes: true ONLY if the user explicitly said "nuke the data".`,
parameters: {
type: 'OBJECT',
properties: {
uuid: { type: 'STRING', description: 'The Coolify service / app / database UUID.' },
wipeVolumes: { type: 'BOOLEAN', description: 'If true, also remove anonymous volumes (data loss). Default false.' },
},
required: ['uuid'],
},
},
{
name: 'apps_templates_list',
description: 'Browse the Coolify one-click template catalog (320+ apps: CRMs, AI tools, CMSes, dashboards, databases). Each is deployable via apps_create with { template: slug }.',