Observability & Operations Overview¶

This document details the monitoring, logging, and operational readiness strategy for Pebble Orchestrator (Pebble IQ).

Scope Breakdown (from 1,850h Estimate)¶

Operational Component	Hours	Coverage
CI/CD & Security	52h	GitHub Actions, Docker builds, secret management
Backups, Monitoring, DR	52h	PostgreSQL backups, Redis snapshots, Prometheus/Grafana
Reporting & Audit	72h	Activity Stream, compliance logs, executive dashboards
Total Operations	176h	~9.5% of total budget

1. Audit Logging (Activity Stream)¶

What's Being Logged¶

Every business event is captured in the activity_stream table as an immutable event:

Event Type	Payload Example	Retention
`EmailReceived`	`{message_id, from, subject, stream}`	7 years (compliance)
`LeadClassified`	`{ai_intent, confidence, manual_override}`	7 years
`DraftApproved`	`{user_id, draft_id, edits_made}`	7 years
`TallyLookup`	`{product, stock_returned, timestamp}`	1 year
`CompanyValidated`	`{cin, pan, gstin, source}`	7 years

Compliance Features¶

Who Did What: Every action tied to user_id or ai_agent_id
Tamper-Proof: Write-only table (no updates/deletes)
Searchable: Indexed by entity (Company, Email, Lead) for audit queries
Export: CSV/Excel export for regulators

2. MLOps (AI Model Monitoring)¶

Current Scope (Included in 576h AI Budget)¶

MLOps Component	Effort	Status
Training Pipeline	~40h	Part of AI Intelligence (132h)
Inference Monitoring	~30h	Part of AI Eval (100h)
Drift Detection	~20h	Basic threshold alerts
Retraining	~30h	Manual trigger for POC

What's NOT in MVP (Phase 4)¶

Automated drift detection (Evidently AI)
A/B testing infrastructure (multiple model versions)
Feature store (centralized entity embeddings)
Auto-retraining pipelines

Key MVP Metrics (Prometheus)¶

# Email Classification
pebble_ai_classification_latency_seconds
pebble_ai_classification_confidence_score
pebble_ai_classification_override_rate

# Tally Integration (Phase 1)
pebble_tally_lookup_success_rate
pebble_tally_lookup_latency_seconds

# LangGraph Agents (Phase 1)
pebble_draft_generation_latency_seconds
pebble_draft_approval_rate

3. DevOps & Infrastructure¶

CI/CD Pipeline (52h)¶

Includes: - GitHub Actions workflows (test → build → deploy) - Docker multi-stage builds - Staging vs Production environments - Automated DB migrations (Alembic) - Secret rotation (Vault or GH Secrets)

Example Workflow:

# .github/workflows/ci.yml
on: [push]
jobs:
  test:
    - pytest --cov=80%
  build:
    - docker build -t pebble-api:$SHA
  deploy-staging:
    if: branch == develop
    - kubectl apply -f k8s/staging/

Monitoring & Alerting (52h)¶

Stack: - Metrics: Prometheus + Grafana - Logs: Loki (or ELK) - APM: Sentry (error tracking) - Uptime: UptimeRobot (external)

Key Dashboards: 1. Email Pipeline: Emails ingested/hour, classification rate, error rate 2. Tally Health (Phase 1): XML failures, stock lookup latency 3. CRM Performance: API p95 latency, DB connection pool 4. LangGraph Agents (Phase 1): Draft generation success rate, human override %

Disaster Recovery (Included in 52h)¶

Asset	Backup Frequency	Recovery Time
PostgreSQL	Every 6 hours	< 1 hour
Redis	Daily snapshot	< 30 min
Email Attachments (MinIO)	Daily	< 2 hours
Activity Stream	Every 1 hour	< 30 min (critical)

4. Reporting & Executive Dashboards (72h)¶

Operational Dashboards¶

Sales Ops Dashboard - Emails processed today/week - AI classification accuracy (% overrides) - Lead conversion funnel - Staff response time (P50, P95)

Owner Dashboard (Phase 1 - Tally) - Draft approval rate by staff member - Most common customer queries - Tally sync health

Compliance Reports¶

Quarterly Audit Export - All AI decisions (classification + drafts) - Manual overrides with justification - Failed validations (CIN/PAN/GST)

GDPR/Data Subject Requests - Export all data for a given email/company - Delete request workflow (soft-delete with audit trail)

5. Security & Access Control¶

Authentication (Included in CRM 184h)¶

JWT-based auth
Role-based access (Admin, Sales, Ops, Read-Only)
Session timeout (30 min inactivity)

Security Hardening (52h Operations)¶

SQL injection prevention (parameterized queries)
CORS policy for CRM UI
Rate limiting (100 req/min per user)
Secrets in env vars (never in code)

Penetration Testing (Included in QA 240h)¶

OWASP Top 10 checklist
Third-party pentest (if budget allows)
Vulnerability scanning (Snyk/Trivy)

6. What's NOT Included (Future Enhancements)¶

Feature	Phase	Reasoning
Kubernetes Orchestration	Post-MVP	Docker Compose sufficient for POC
Multi-Region Deployment	Phase 3	Single India region for now
Advanced MLOps (Feature Store)	Phase 4	Manual retraining OK for POC
Real-Time Anomaly Detection	Phase 4	Threshold alerts sufficient
Chaos Engineering	Post-MVP	Not critical for 8-week POC

7. Cost Monitoring (Not in Scope)¶

Note: The 1,850h estimate does NOT include ongoing operational costs: - Cloud hosting (AWS/Azure/GCP) - LLM API costs (Azure OpenAI tokens) - Third-party SaaS (Sentry, monitoring tools)

Estimated Monthly Run Cost (Phase 1): - Compute: $200-500/month (2-3 VMs) - Database: $100/month (managed PostgreSQL) - LLM API: $50-200/month (depends on email volume) - Total: ~$500-1000/month

Summary: Is 176h Enough for Ops?¶

Yes for MVP, because: - We're not building Kubernetes (Docker Compose OK) - Activity Stream is simple write-only table - Prometheus + Grafana are OSS (low setup time) - Disaster recovery is PostgreSQL point-in-time recovery

But Phase 1 (Tally) may need +20-30h for: - Tally sync monitoring dashboard - Draft approval analytics - Staff performance tracking (who's using AI drafts?)

← Back to Home | View Architecture