4.9 KiB
4.9 KiB
Document Upload - Chunking Removed ✅
Issue Found
Despite the Collector/Extractor refactor, document uploads were still auto-chunking files into semantic pieces.
What Was Happening (Before)
// upload-document/route.ts
const chunks = chunkDocument(content, {
maxChunkSize: 2000,
chunkOverlap: 200,
});
for (const chunk of chunks) {
await createKnowledgeItem({
title: `${file.name} (chunk ${i}/${total})`,
content: chunk.content,
});
}
Result:
- 1 file upload → 5-10 separate knowledge_items
- Each chunk stored as separate record
- Auto-chunking contradicted Extractor AI's collaborative approach
What Happens Now (After)
// upload-document/route.ts
const knowledgeItem = await createKnowledgeItem({
title: file.name,
content: content, // Whole document
sourceMeta: {
tags: ['document', 'uploaded', 'pending_extraction'],
},
});
Result:
- 1 file upload → 1 knowledge_item
- Whole document stored intact
- Tagged as
pending_extraction - Extractor AI will review and collaboratively chunk
Files Changed
1. app/api/projects/[projectId]/knowledge/upload-document/route.ts
Removed:
chunkDocument()import and calls- Loop creating multiple knowledge_items
- Chunk metadata tracking
Added:
- Single knowledge_item creation with full content
pending_extractiontag- Status tracking in contextSources
Before:
const chunks = chunkDocument(content, {...});
for (const chunk of chunks) {
const knowledgeItem = await createKnowledgeItem({
title: `${file.name} (chunk ${i}/${total})`,
content: chunk.content,
});
}
After:
const knowledgeItem = await createKnowledgeItem({
title: file.name,
content: content, // Whole document
sourceMeta: {
tags: ['pending_extraction'],
},
});
2. app/[workspace]/project/[projectId]/context/page.tsx
Changed UI text:
- Before: "Documents will be automatically chunked and processed for AI context."
- After: "Documents will be stored for the Extractor AI to review and process."
User Experience Changes
Upload Flow (Now):
- User uploads
project-spec.md - File saved to Firebase Storage
- Whole document stored as 1 knowledge_item
- Appears in Context page as "project-spec.md"
- Tagged
pending_extraction
Extraction Flow (Later):
- User says "Is that everything?" → AI transitions
- Extractor AI mode activates
- AI reads whole documents
- AI asks: "I see this section about user roles - is this important for V1?"
- User confirms: "Yes, that's critical"
- AI calls
/api/projects/{id}/knowledge/chunk-insight - Creates targeted chunk as
extracted_insight - Chunks stored in AlloyDB for retrieval
Why This Matters
Before (Auto-chunking):
- ❌ System guessed what's important
- ❌ Over-chunked irrelevant sections
- ❌ Polluted vector database with noise
- ❌ User had no control
After (Collaborative):
- ✅ Extractor AI asks before chunking
- ✅ Only important sections chunked
- ✅ User confirms what matters for V1
- ✅ Clean, relevant vector database
API Response Changes
Before:
{
"success": true,
"chunkCount": 8,
"knowledgeItemIds": ["id1", "id2", "id3", ...]
}
After:
{
"success": true,
"knowledgeItemId": "single_id",
"status": "stored",
"message": "Document stored. Extractor AI will review and chunk important sections."
}
Database Structure
Firestore - knowledge_items:
{
"id": "abc123",
"projectId": "proj456",
"sourceType": "imported_document",
"title": "project-spec.md",
"content": "< FULL DOCUMENT CONTENT >",
"sourceMeta": {
"filename": "project-spec.md",
"tags": ["document", "uploaded", "pending_extraction"],
"url": "https://storage.googleapis.com/..."
}
}
Firestore - contextSources:
{
"type": "document",
"name": "project-spec.md",
"summary": "Document (5423 characters) - pending extraction",
"metadata": {
"knowledgeItemId": "abc123",
"status": "pending_extraction"
}
}
Testing Checklist
- Remove chunking logic from upload endpoint
- Update UI text to reflect new behavior
- Verify whole document is stored
- Confirm
pending_extractiontag is set - Test document upload with 3 files
- Verify Collector checklist updates
- Test Extractor AI reads full documents
- Test
/chunk-insightAPI creates extracted chunks
Related Documentation
TABLE_STAKES_IMPLEMENTATION.md- Full feature implementationCOLLECTOR_EXTRACTOR_REFACTOR.md- Refactor rationaleQA_FIXES_APPLIED.md- QA testing results
Status
✅ Auto-chunking removed ✅ UI text updated ✅ Server restarted 🔄 Ready for testing
The upload flow now correctly stores whole documents and defers chunking to the collaborative Extractor AI phase.