Deduplication Use Cases (DUP) - ARCHIVED/FOUNDATIONS
Module Status: Archived as a standalone document task.
Orchestrator Role: These capabilities have been repurposed for the Visual Kanban to allow users to "Hide/Stack" duplicate enquiries and track conversation history across multiple legal entities.
📖 What is Deduplication?
Deduplication (or "dedup") is the process of finding and removing duplicate documents from your system. Think of it like cleaning up your phone's photo gallery — finding all those duplicate photos taking up space and keeping just one copy.
Why Does It Matter?
| Problem |
Impact |
| Storage costs |
Duplicates waste expensive storage space |
| Confusion |
Multiple copies lead to version control nightmares |
| Compliance risk |
Outdated duplicates may contain incorrect information |
| Search clutter |
Finding the right document becomes harder |
Real-World Example: A hospital with 100,000 patient documents might have 15% duplicates — that's 15,000 files wasting storage and causing confusion. Deduplication can reclaim that space and ensure staff always find the right document.
⚠️ What's NOT in MVP
[!IMPORTANT]
The MVP focuses only on deduplication. The following features are planned for later phases:
| Feature |
Phase |
Timeline |
| AI Chat Assistant |
Phase 3 |
Month 6-8 |
| Document Summarization |
Phase 3 |
Month 6-8 |
| Semantic Search |
Phase 1 |
Week 7-12 |
| Auto-Classification |
Phase 1 |
Week 7-12 |
| Workflow Automation |
Phase 2 |
Month 3-6 |
| E-Signature Integration |
Phase 2 |
Month 3-6 |
📖 How Deduplication Works
flowchart LR
A[📄 Document Upload] --> B{Same file exists?}
B -->|Yes| C[🔴 Exact Duplicate]
B -->|No| D{Similar content?}
D -->|Yes| E[🟡 Near Duplicate]
D -->|No| F[🟢 Unique Document]
C --> G[📊 Dedup Report]
E --> G
F --> H[✅ Store Document]
Step-by-step:
- Upload — A new document arrives in the system
- Fingerprint Check — We create a digital fingerprint (hash) and check if an identical file exists
- Content Comparison — If not identical, we compare the actual content for similar documents
- Report — All duplicates are flagged for review or automatic cleanup
📖 Three Types of Duplicates
| Type |
What It Means |
Example |
How We Detect It |
| Exact Duplicate |
Identical file, bit-for-bit |
Same invoice uploaded twice |
Digital fingerprint (hash) |
| Near Duplicate |
Same content, minor differences |
Invoice with updated date |
Content comparison (AI) |
| Visual Duplicate |
Same image, different size/quality |
Logo resized or cropped |
Image comparison |
📖 Key Terms Glossary
| Technical Term |
Plain English |
| Hash |
A digital fingerprint — a unique code generated from a file's contents |
| Embedding |
A content signature that captures the meaning of a document |
| Similarity Score |
How alike two documents are, expressed as a percentage (0-100%) |
| Perceptual Hash (pHash) |
A fingerprint for images that works even if the image is resized |
| Vector Database |
A specialized database for finding similar items quickly |
🔧 Technical Use Cases
The following sections contain detailed technical specifications for developers.
Use Case Quick Reference
| ID |
Title |
Priority |
| DUP-001 |
Compute File Hash (MD5/SHA256) |
P1 |
| DUP-002 |
Hide Duplicate Cards |
P1 |
| DUP-003 |
Compute Document Embedding |
P1 |
| DUP-004 |
Find Near-Duplicates |
P1 |
| DUP-005 |
Compute Visual Similarity (Images) |
P2 |
| DUP-006 |
Merge Duplicate Records |
P2 |
| DUP-007 |
Archive/Delete Duplicates |
P2 |
| DUP-008 |
Generate Dedup Report |
P2 |
UC-DUP-001: Compute File Hash
Overview
| Field |
Value |
| ID |
DUP-001 |
| Title |
Compute File Hash (MD5/SHA256) |
| Actor |
System |
| Priority |
P1 (MVP Phase 1) |
Description
Calculate cryptographic hashes of uploaded files for exact duplicate detection.
Steps
- Read file in chunks (64KB)
- Update MD5 and SHA256 hash objects
- Finalize and store both hashes
Output
{
"hash_md5": "d41d8cd98f00b204e9800998ecf8427e",
"hash_sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}
Acceptance Criteria
UC-DUP-002: Hide Duplicate Cards
Overview
| Field |
Value |
| ID |
DUP-002 |
| Title |
Hide Duplicate Cards |
| Actor |
System / Board UI |
| Priority |
P1 (MVP Phase 1) |
Description
Instead of rejecting duplicates, the system detects them and hides or stacks them in the Visual Kanban so the user sees only unique information.
Steps
- Compute hash of incoming email/attachment.
- Query database for existing match.
- If Match Found:
- Mark new item with
is_hidden_duplicate = true.
- Link to primary (original) item.
- UI Action: Do not spawn a new card; optionally increment "Dup Count" on original card.
Output
{
"is_duplicate": true,
"action": "hide_from_board",
"primary_card_id": "card_123"
}
Acceptance Criteria
UC-DUP-003: Compute Document Embedding
Overview
| Field |
Value |
| ID |
DUP-003 |
| Title |
Compute Document Embedding |
| Actor |
Embedding Worker |
| Priority |
P1 (MVP Phase 2) |
Description
Generate vector embedding from document text for semantic similarity detection.
Steps
- Retrieve extracted text from document
- Truncate/chunk if >512 tokens
- Pass through embedding model
- Store embedding in vector database
Model Configuration
| Setting |
Value |
| Model |
all-MiniLM-L6-v2 |
| Dimensions |
384 |
| Max tokens |
512 |
| Pooling |
Mean |
Output
{
"document_id": "doc_abc",
"embedding_id": "emb_123",
"dimensions": 384,
"model": "all-MiniLM-L6-v2"
}
Acceptance Criteria
UC-DUP-004: Find Near-Duplicates
Overview
| Field |
Value |
| ID |
DUP-004 |
| Title |
Find Near-Duplicates |
| Actor |
System |
| Priority |
P1 (MVP Phase 2) |
Description
Identify documents with similar content using embedding similarity.
Steps
- Retrieve document embedding
- Query vector store for similar documents
- Apply similarity threshold (configurable)
- Return ranked list of candidates
Similarity Thresholds
| Threshold |
Interpretation |
| >0.99 |
Exact content duplicate |
| 0.95-0.99 |
Near-duplicate (minor edits) |
| 0.85-0.95 |
Similar document |
| <0.85 |
Different document |
Output
{
"document_id": "doc_abc",
"near_duplicates": [
{
"id": "doc_xyz",
"similarity": 0.97,
"title": "Invoice #123 (v2)"
},
{
"id": "doc_def",
"similarity": 0.92,
"title": "Invoice #123 (draft)"
}
]
}
Acceptance Criteria
UC-DUP-005: Compute Visual Similarity
Overview
| Field |
Value |
| ID |
DUP-005 |
| Title |
Compute Visual Similarity (Images) |
| Actor |
System |
| Priority |
P2 (MVP Phase 3) |
Description
Detect duplicate images even with different resolutions or minor edits using perceptual hashing.
Hash Types
| Algorithm |
Use Case |
| pHash |
Photo similarity |
| dHash |
Difference hash (fast) |
| aHash |
Average hash (robust) |
Steps
- Resize image to standard size (8x8 or 16x16)
- Convert to grayscale
- Compute perceptual hash
- Store hash for comparison
- Query for similar hashes (Hamming distance)
Output
{
"document_id": "doc_img123",
"phash": "8f0f0f0f0f0f0f0f",
"similar_images": [
{
"id": "doc_img456",
"hamming_distance": 2,
"similarity": 0.97
}
]
}
Acceptance Criteria
UC-DUP-006: Merge Duplicate Records
Overview
| Field |
Value |
| ID |
DUP-006 |
| Title |
Merge Duplicate Records |
| Actor |
User |
| Priority |
P2 |
Description
Allow users to manually merge duplicate documents, combining metadata and tags.
Steps
- User selects documents to merge
- Choose primary document
- Merge tags from all documents
- Update references to point to primary
- Archive or delete secondary documents
Acceptance Criteria
UC-DUP-007: Archive or Delete Duplicates
Overview
| Field |
Value |
| ID |
DUP-007 |
| Title |
Archive or Delete Duplicates |
| Actor |
User, Admin |
| Priority |
P2 |
Description
Remove or archive confirmed duplicate documents.
Options
| Action |
Behavior |
| Archive |
Move to archive, retain metadata |
| Delete |
Soft delete, recoverable |
| Purge |
Hard delete, permanent |
Acceptance Criteria
UC-DUP-008: Generate Dedup Report
Overview
| Field |
Value |
| ID |
DUP-008 |
| Title |
Generate Dedup Report |
| Actor |
User, Admin |
| Priority |
P2 |
Description
Generate a report of detected duplicates and storage savings.
Report Contents
| Section |
Details |
| Summary |
Total docs, duplicates found, storage saved |
| Exact Duplicates |
List with file sizes |
| Near-Duplicates |
List with similarity scores |
| Recommendations |
Suggested actions |
Output
{
"generated_at": "2024-01-15T10:00:00Z",
"summary": {
"total_documents": 10000,
"exact_duplicates": 1500,
"near_duplicates": 800,
"potential_savings_gb": 45.2
},
"exact_duplicate_groups": [...],
"near_duplicate_groups": [...]
}
Acceptance Criteria
← Back to Use Cases | Previous: Ingestion | Next: Classification →