214 lines
4.9 KiB
Markdown
214 lines
4.9 KiB
Markdown
# Document Upload - Chunking Removed ✅
|
|
|
|
## Issue Found
|
|
Despite the Collector/Extractor refactor, document uploads were still auto-chunking files into semantic pieces.
|
|
|
|
## What Was Happening (Before)
|
|
```typescript
|
|
// upload-document/route.ts
|
|
const chunks = chunkDocument(content, {
|
|
maxChunkSize: 2000,
|
|
chunkOverlap: 200,
|
|
});
|
|
|
|
for (const chunk of chunks) {
|
|
await createKnowledgeItem({
|
|
title: `${file.name} (chunk ${i}/${total})`,
|
|
content: chunk.content,
|
|
});
|
|
}
|
|
```
|
|
|
|
**Result:**
|
|
- 1 file upload → 5-10 separate knowledge_items
|
|
- Each chunk stored as separate record
|
|
- Auto-chunking contradicted Extractor AI's collaborative approach
|
|
|
|
## What Happens Now (After)
|
|
```typescript
|
|
// upload-document/route.ts
|
|
const knowledgeItem = await createKnowledgeItem({
|
|
title: file.name,
|
|
content: content, // Whole document
|
|
sourceMeta: {
|
|
tags: ['document', 'uploaded', 'pending_extraction'],
|
|
},
|
|
});
|
|
```
|
|
|
|
**Result:**
|
|
- 1 file upload → 1 knowledge_item
|
|
- Whole document stored intact
|
|
- Tagged as `pending_extraction`
|
|
- Extractor AI will review and collaboratively chunk
|
|
|
|
---
|
|
|
|
## Files Changed
|
|
|
|
### 1. `app/api/projects/[projectId]/knowledge/upload-document/route.ts`
|
|
|
|
**Removed:**
|
|
- `chunkDocument()` import and calls
|
|
- Loop creating multiple knowledge_items
|
|
- Chunk metadata tracking
|
|
|
|
**Added:**
|
|
- Single knowledge_item creation with full content
|
|
- `pending_extraction` tag
|
|
- Status tracking in contextSources
|
|
|
|
**Before:**
|
|
```typescript
|
|
const chunks = chunkDocument(content, {...});
|
|
for (const chunk of chunks) {
|
|
const knowledgeItem = await createKnowledgeItem({
|
|
title: `${file.name} (chunk ${i}/${total})`,
|
|
content: chunk.content,
|
|
});
|
|
}
|
|
```
|
|
|
|
**After:**
|
|
```typescript
|
|
const knowledgeItem = await createKnowledgeItem({
|
|
title: file.name,
|
|
content: content, // Whole document
|
|
sourceMeta: {
|
|
tags: ['pending_extraction'],
|
|
},
|
|
});
|
|
```
|
|
|
|
### 2. `app/[workspace]/project/[projectId]/context/page.tsx`
|
|
|
|
**Changed UI text:**
|
|
- **Before:** "Documents will be automatically chunked and processed for AI context."
|
|
- **After:** "Documents will be stored for the Extractor AI to review and process."
|
|
|
|
---
|
|
|
|
## User Experience Changes
|
|
|
|
### Upload Flow (Now):
|
|
1. User uploads `project-spec.md`
|
|
2. File saved to Firebase Storage
|
|
3. **Whole document** stored as 1 knowledge_item
|
|
4. Appears in Context page as "project-spec.md"
|
|
5. Tagged `pending_extraction`
|
|
|
|
### Extraction Flow (Later):
|
|
1. User says "Is that everything?" → AI transitions
|
|
2. Extractor AI mode activates
|
|
3. AI reads whole documents
|
|
4. AI asks: "I see this section about user roles - is this important for V1?"
|
|
5. User confirms: "Yes, that's critical"
|
|
6. AI calls `/api/projects/{id}/knowledge/chunk-insight`
|
|
7. Creates targeted chunk as `extracted_insight`
|
|
8. Chunks stored in AlloyDB for retrieval
|
|
|
|
---
|
|
|
|
## Why This Matters
|
|
|
|
### Before (Auto-chunking):
|
|
- ❌ System guessed what's important
|
|
- ❌ Over-chunked irrelevant sections
|
|
- ❌ Polluted vector database with noise
|
|
- ❌ User had no control
|
|
|
|
### After (Collaborative):
|
|
- ✅ Extractor AI asks before chunking
|
|
- ✅ Only important sections chunked
|
|
- ✅ User confirms what matters for V1
|
|
- ✅ Clean, relevant vector database
|
|
|
|
---
|
|
|
|
## API Response Changes
|
|
|
|
### Before:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"chunkCount": 8,
|
|
"knowledgeItemIds": ["id1", "id2", "id3", ...]
|
|
}
|
|
```
|
|
|
|
### After:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"knowledgeItemId": "single_id",
|
|
"status": "stored",
|
|
"message": "Document stored. Extractor AI will review and chunk important sections."
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Database Structure
|
|
|
|
### Firestore - knowledge_items:
|
|
```json
|
|
{
|
|
"id": "abc123",
|
|
"projectId": "proj456",
|
|
"sourceType": "imported_document",
|
|
"title": "project-spec.md",
|
|
"content": "< FULL DOCUMENT CONTENT >",
|
|
"sourceMeta": {
|
|
"filename": "project-spec.md",
|
|
"tags": ["document", "uploaded", "pending_extraction"],
|
|
"url": "https://storage.googleapis.com/..."
|
|
}
|
|
}
|
|
```
|
|
|
|
### Firestore - contextSources:
|
|
```json
|
|
{
|
|
"type": "document",
|
|
"name": "project-spec.md",
|
|
"summary": "Document (5423 characters) - pending extraction",
|
|
"metadata": {
|
|
"knowledgeItemId": "abc123",
|
|
"status": "pending_extraction"
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Testing Checklist
|
|
|
|
- [x] Remove chunking logic from upload endpoint
|
|
- [x] Update UI text to reflect new behavior
|
|
- [x] Verify whole document is stored
|
|
- [x] Confirm `pending_extraction` tag is set
|
|
- [ ] Test document upload with 3 files
|
|
- [ ] Verify Collector checklist updates
|
|
- [ ] Test Extractor AI reads full documents
|
|
- [ ] Test `/chunk-insight` API creates extracted chunks
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- `TABLE_STAKES_IMPLEMENTATION.md` - Full feature implementation
|
|
- `COLLECTOR_EXTRACTOR_REFACTOR.md` - Refactor rationale
|
|
- `QA_FIXES_APPLIED.md` - QA testing results
|
|
|
|
---
|
|
|
|
## Status
|
|
|
|
✅ **Auto-chunking removed**
|
|
✅ **UI text updated**
|
|
✅ **Server restarted**
|
|
🔄 **Ready for testing**
|
|
|
|
The upload flow now correctly stores whole documents and defers chunking to the collaborative Extractor AI phase.
|
|
|