DevOps & Infrastructure (DVO)¶
Module Purpose: The platform's life support systems. It ensures the Pebble Orchestrator is secure, available, and recoverable. It covers automated provisioning, backups, and real-time observability.
[!IMPORTANT] Production Standard: All infrastructure is defined as code (IaC) to ensure 4-hour Recovery Time Objective (RTO).
Use Case Quick Reference¶
| ID | Title | Priority |
|---|---|---|
| US-DVO-001 | Automated User Provisioning | P1 |
| US-DVO-002 | Automated Backups & Retention | P1 |
| US-DVO-003 | Real-time Uptime Monitoring | P1 |
| US-DVO-004 | Disaster Recovery Procedure | P1 |
US-DVO-001: Automated User Provisioning¶
What It Does¶
Syncs users from the central Pebble Auth system to all integrated tools (like Plane.so) automatically. When a new sales rep is hired in the CRM, they instantly get access to the Kanban board without manual IT intervention.
Who: System Admin / HR Trigger
When: On User Creation
How It Works¶
- Trigger: New user account created in Pebble (Django).
- Action:
- Fires a
User.createdsignal. - Calls Plane API to create a workspace member with specific role (Member/Admin).
- Generates and emails an invitation link.
- Offboarding: Disabling Pebble account automatically revokes tokens for Plane.
US-DVO-002: Automated Backups & Retention¶
What It Does¶
Protects against data corruption or accidental deletion. It performs nightly snapshots of the entire stack and offloads them to secure, off-site storage.
Who: DevOps Automation
When: Nightly (2 AM)
How It Works¶
- Database: Runs
pg_dumpon PostgreSQL (Pebble + Plane). - Storage: Offloads compressed dumps to S3/MinIO.
- Retention:
- Nightly backups kept for 30 days.
- Monthly snapshots kept for 1 year.
- Alerting: If backup fails to upload, sends urgent Slack ping to IT.
US-DVO-003: Real-time Uptime Monitoring¶
What It Does¶
The "Dashlight". Monitors the health of all microservices and triggers alerts before users notice a slowdown.
Who: System Watchdog
When: Continuous
How It Works¶
- Heartbeat: Polls
/healthendpoints of Email Listener, Classifier, and Plane. - Latency: Tracks API response times (Target: < 500ms).
- Dashboard: Grafana displays:
- Green = Healthy.
- Yellow = Degraded (Slow).
- Red = Down.
- Escalation: If down for > 5 mins, triggers SMS to On-call Engineer.
US-DVO-004: Disaster Recovery Procedure¶
What It Does¶
The "Emergency Break". A documented and tested procedure to rebuild the entire platform from scratch in a new region/server in under 4 hours.
Who: IT Recovery Team
When: Critical System Failure
How It Works¶
- Infrastructure: Deploy using Terraform scripts to AWS/Azure/DigitalOcean.
- Restore: Pull latest data dump from DVO-002.
- Verification: Automated test suite runs to check Ingestion -> Classification flow.
- Cutover: Update DNS records to points to new environment.