Skip to content

Deduplication Use Cases (DUP) - ARCHIVED/FOUNDATIONS

Module Status: Archived as a standalone document task. Orchestrator Role: These capabilities have been repurposed for the Visual Kanban to allow users to "Hide/Stack" duplicate enquiries and track conversation history across multiple legal entities.


📖 What is Deduplication?

Deduplication (or "dedup") is the process of finding and removing duplicate documents from your system. Think of it like cleaning up your phone's photo gallery — finding all those duplicate photos taking up space and keeping just one copy.

Why Does It Matter?

Problem Impact
Storage costs Duplicates waste expensive storage space
Confusion Multiple copies lead to version control nightmares
Compliance risk Outdated duplicates may contain incorrect information
Search clutter Finding the right document becomes harder

Real-World Example: A hospital with 100,000 patient documents might have 15% duplicates — that's 15,000 files wasting storage and causing confusion. Deduplication can reclaim that space and ensure staff always find the right document.

⚠️ What's NOT in MVP

[!IMPORTANT] The MVP focuses only on deduplication. The following features are planned for later phases:

Feature Phase Timeline
AI Chat Assistant Phase 3 Month 6-8
Document Summarization Phase 3 Month 6-8
Semantic Search Phase 1 Week 7-12
Auto-Classification Phase 1 Week 7-12
Workflow Automation Phase 2 Month 3-6
E-Signature Integration Phase 2 Month 3-6

📖 How Deduplication Works

flowchart LR
    A[📄 Document Upload] --> B{Same file exists?}
    B -->|Yes| C[🔴 Exact Duplicate]
    B -->|No| D{Similar content?}
    D -->|Yes| E[🟡 Near Duplicate]
    D -->|No| F[🟢 Unique Document]
    C --> G[📊 Dedup Report]
    E --> G
    F --> H[✅ Store Document]

Step-by-step:

  1. Upload — A new document arrives in the system
  2. Fingerprint Check — We create a digital fingerprint (hash) and check if an identical file exists
  3. Content Comparison — If not identical, we compare the actual content for similar documents
  4. Report — All duplicates are flagged for review or automatic cleanup

📖 Three Types of Duplicates

Type What It Means Example How We Detect It
Exact Duplicate Identical file, bit-for-bit Same invoice uploaded twice Digital fingerprint (hash)
Near Duplicate Same content, minor differences Invoice with updated date Content comparison (AI)
Visual Duplicate Same image, different size/quality Logo resized or cropped Image comparison

📖 Key Terms Glossary

Technical Term Plain English
Hash A digital fingerprint — a unique code generated from a file's contents
Embedding A content signature that captures the meaning of a document
Similarity Score How alike two documents are, expressed as a percentage (0-100%)
Perceptual Hash (pHash) A fingerprint for images that works even if the image is resized
Vector Database A specialized database for finding similar items quickly

🔧 Technical Use Cases

The following sections contain detailed technical specifications for developers.


Use Case Quick Reference

ID Title Priority
DUP-001 Compute File Hash (MD5/SHA256) P1
DUP-002 Hide Duplicate Cards P1
DUP-003 Compute Document Embedding P1
DUP-004 Find Near-Duplicates P1
DUP-005 Compute Visual Similarity (Images) P2
DUP-006 Merge Duplicate Records P2
DUP-007 Archive/Delete Duplicates P2
DUP-008 Generate Dedup Report P2

UC-DUP-001: Compute File Hash

Overview

Field Value
ID DUP-001
Title Compute File Hash (MD5/SHA256)
Actor System
Priority P1 (MVP Phase 1)

Description

Calculate cryptographic hashes of uploaded files for exact duplicate detection.

Steps

  1. Read file in chunks (64KB)
  2. Update MD5 and SHA256 hash objects
  3. Finalize and store both hashes

Output

{
  "hash_md5": "d41d8cd98f00b204e9800998ecf8427e",
  "hash_sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}

Acceptance Criteria

  • Hash computed for all uploaded files
  • Processing time <1s for 100MB files
  • Both MD5 and SHA256 stored

UC-DUP-002: Hide Duplicate Cards

Overview

Field Value
ID DUP-002
Title Hide Duplicate Cards
Actor System / Board UI
Priority P1 (MVP Phase 1)

Description

Instead of rejecting duplicates, the system detects them and hides or stacks them in the Visual Kanban so the user sees only unique information.

Steps

  1. Compute hash of incoming email/attachment.
  2. Query database for existing match.
  3. If Match Found:
    • Mark new item with is_hidden_duplicate = true.
    • Link to primary (original) item.
    • UI Action: Do not spawn a new card; optionally increment "Dup Count" on original card.

Output

{
  "is_duplicate": true,
  "action": "hide_from_board",
  "primary_card_id": "card_123"
}

Acceptance Criteria

  • Duplicate emails do not clutter the Kanban board.
  • Users can toggle "Show Duplicates" filter if needed.
  • Original card indicates "Contains 2 duplicates".

UC-DUP-003: Compute Document Embedding

Overview

Field Value
ID DUP-003
Title Compute Document Embedding
Actor Embedding Worker
Priority P1 (MVP Phase 2)

Description

Generate vector embedding from document text for semantic similarity detection.

Steps

  1. Retrieve extracted text from document
  2. Truncate/chunk if >512 tokens
  3. Pass through embedding model
  4. Store embedding in vector database

Model Configuration

Setting Value
Model all-MiniLM-L6-v2
Dimensions 384
Max tokens 512
Pooling Mean

Output

{
  "document_id": "doc_abc",
  "embedding_id": "emb_123",
  "dimensions": 384,
  "model": "all-MiniLM-L6-v2"
}

Acceptance Criteria

  • Embeddings generated for all text-extracted documents
  • Stored in Qdrant for similarity search
  • Processing time <2s per document

UC-DUP-004: Find Near-Duplicates

Overview

Field Value
ID DUP-004
Title Find Near-Duplicates
Actor System
Priority P1 (MVP Phase 2)

Description

Identify documents with similar content using embedding similarity.

Steps

  1. Retrieve document embedding
  2. Query vector store for similar documents
  3. Apply similarity threshold (configurable)
  4. Return ranked list of candidates

Similarity Thresholds

Threshold Interpretation
>0.99 Exact content duplicate
0.95-0.99 Near-duplicate (minor edits)
0.85-0.95 Similar document
<0.85 Different document

Output

{
  "document_id": "doc_abc",
  "near_duplicates": [
    {
      "id": "doc_xyz",
      "similarity": 0.97,
      "title": "Invoice #123 (v2)"
    },
    {
      "id": "doc_def",
      "similarity": 0.92,
      "title": "Invoice #123 (draft)"
    }
  ]
}

Acceptance Criteria

  • Near-duplicates detected with >90% precision
  • Query time <100ms
  • Threshold is configurable

UC-DUP-005: Compute Visual Similarity

Overview

Field Value
ID DUP-005
Title Compute Visual Similarity (Images)
Actor System
Priority P2 (MVP Phase 3)

Description

Detect duplicate images even with different resolutions or minor edits using perceptual hashing.

Hash Types

Algorithm Use Case
pHash Photo similarity
dHash Difference hash (fast)
aHash Average hash (robust)

Steps

  1. Resize image to standard size (8x8 or 16x16)
  2. Convert to grayscale
  3. Compute perceptual hash
  4. Store hash for comparison
  5. Query for similar hashes (Hamming distance)

Output

{
  "document_id": "doc_img123",
  "phash": "8f0f0f0f0f0f0f0f",
  "similar_images": [
    {
      "id": "doc_img456",
      "hamming_distance": 2,
      "similarity": 0.97
    }
  ]
}

Acceptance Criteria

  • Detects resized duplicates
  • Detects cropped images
  • Tolerates minor quality changes

UC-DUP-006: Merge Duplicate Records

Overview

Field Value
ID DUP-006
Title Merge Duplicate Records
Actor User
Priority P2

Description

Allow users to manually merge duplicate documents, combining metadata and tags.

Steps

  1. User selects documents to merge
  2. Choose primary document
  3. Merge tags from all documents
  4. Update references to point to primary
  5. Archive or delete secondary documents

Acceptance Criteria

  • Tags are combined without duplicates
  • References are updated
  • Merge history is logged

UC-DUP-007: Archive or Delete Duplicates

Overview

Field Value
ID DUP-007
Title Archive or Delete Duplicates
Actor User, Admin
Priority P2

Description

Remove or archive confirmed duplicate documents.

Options

Action Behavior
Archive Move to archive, retain metadata
Delete Soft delete, recoverable
Purge Hard delete, permanent

Acceptance Criteria

  • Duplicates can be archived
  • Deletion is reversible (soft delete)
  • Storage is reclaimed after purge

UC-DUP-008: Generate Dedup Report

Overview

Field Value
ID DUP-008
Title Generate Dedup Report
Actor User, Admin
Priority P2

Description

Generate a report of detected duplicates and storage savings.

Report Contents

Section Details
Summary Total docs, duplicates found, storage saved
Exact Duplicates List with file sizes
Near-Duplicates List with similarity scores
Recommendations Suggested actions

Output

{
  "generated_at": "2024-01-15T10:00:00Z",
  "summary": {
    "total_documents": 10000,
    "exact_duplicates": 1500,
    "near_duplicates": 800,
    "potential_savings_gb": 45.2
  },
  "exact_duplicate_groups": [...],
  "near_duplicate_groups": [...]
}

Acceptance Criteria

  • Report includes all duplicate types
  • Storage savings calculated
  • Export to CSV/PDF available

← Back to Use Cases | Previous: Ingestion | Next: Classification →