Backup & Disaster Recovery
This guide covers backup strategies, recovery procedures, and RTO/RPO targets for Datafly Signal deployments.
What to Back Up
Datafly Signal stores persistent data in three places:
| Store | Data | Criticality | Recovery Impact |
|---|---|---|---|
| PostgreSQL | Organisations, users, sources, integrations, pipelines, audit logs | Critical | Full data loss without backup |
| Kafka | In-flight events (topics with retention) | Important | Events reprocessed from retention window |
| Redis | Pipeline key cache, session cache, rate limit counters | Low | Automatically rebuilt on restart |
Redis data is ephemeral and does not require backup. Pipeline key lookups and session data are rebuilt from PostgreSQL on restart. Event data in Kafka is transient — once delivered to vendors, events are no longer needed.
RTO / RPO Targets
| Tier | RPO (Data Loss) | RTO (Recovery Time) |
|---|---|---|
| Standard | 1 hour | 4 hours |
| Enhanced | 15 minutes | 1 hour |
| High Availability | 0 (synchronous replication) | 15 minutes |
These targets apply to the PostgreSQL database — the single source of truth for all configuration and operational data.
PostgreSQL Backup
AWS RDS
RDS provides automated daily backups. Ensure these are enabled:
# Check backup settings
aws rds describe-db-instances \
--db-instance-identifier datafly-postgres \
--query 'DBInstances[0].{BackupRetention:BackupRetentionPeriod,BackupWindow:PreferredBackupWindow}'
# Enable automated backups (if not already)
aws rds modify-db-instance \
--db-instance-identifier datafly-postgres \
--backup-retention-period 7 \
--preferred-backup-window "02:00-03:00"Point-in-time recovery:
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier datafly-postgres \
--target-db-instance-identifier datafly-postgres-recovery \
--restore-time "2025-01-15T10:30:00Z"Manual snapshot:
aws rds create-db-snapshot \
--db-instance-identifier datafly-postgres \
--db-snapshot-identifier datafly-pre-upgrade-$(date +%Y%m%d)GCP Cloud SQL
Cloud SQL provides automated daily backups:
# Check backup configuration
gcloud sql instances describe datafly-postgres \
--format="value(settings.backupConfiguration)"
# Enable automated backups
gcloud sql instances patch datafly-postgres \
--backup-start-time="02:00" \
--enable-point-in-time-recoveryPoint-in-time recovery:
gcloud sql instances clone datafly-postgres datafly-postgres-recovery \
--point-in-time="2025-01-15T10:30:00Z"Manual backup:
gcloud sql backups create --instance=datafly-postgres \
--description="Pre-upgrade backup"Azure Database for PostgreSQL
# Check backup configuration (built-in, 7-35 days retention)
az postgres flexible-server show \
--resource-group datafly-rg --name datafly-postgres \
--query backup
# Point-in-time restore
az postgres flexible-server restore \
--resource-group datafly-rg \
--name datafly-postgres-recovery \
--source-server datafly-postgres \
--restore-time "2025-01-15T10:30:00Z"Self-Managed PostgreSQL
Use pg_dump for logical backups:
# Full backup
pg_dump -h localhost -U datafly -d datafly \
--format=custom --file=datafly-$(date +%Y%m%d).dump
# Restore
pg_restore -h localhost -U datafly -d datafly \
--clean --if-exists datafly-20250115.dumpFor continuous archiving (enhanced RPO), configure WAL archiving:
# postgresql.conf
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'Kafka Backup
Kafka event data is transient. Once events are delivered to vendor APIs, they are no longer needed. Backup is optional and primarily useful for:
- Replay during incident recovery
- Audit and compliance requirements
Topic Retention
Configure Kafka topic retention to meet your replay requirements:
| Topic | Recommended Retention | Purpose |
|---|---|---|
events.raw | 7 days | Replay from ingestion |
events.processed | 7 days | Replay from processing |
events.delivery.* | 3 days | Redeliver to vendors |
events.dlq | 30 days | Failed delivery investigation |
Kafka MirrorMaker (Cross-Region DR)
For multi-region disaster recovery, use Kafka MirrorMaker 2 to replicate topics to a standby cluster:
# mm2.properties
clusters = primary, standby
primary.bootstrap.servers = primary-kafka:9092
standby.bootstrap.servers = standby-kafka:9092
primary->standby.enabled = true
primary->standby.topics = events\\..*Disaster Recovery Procedures
Scenario 1: Single Pod Failure
Impact: Minimal. Kubernetes automatically restarts the pod.
Recovery: Automatic. PodDisruptionBudgets ensure at least one replica remains available.
Action required: None. Monitor for repeated restarts (see Observability).
Scenario 2: Node Failure
Impact: Pods on the failed node are rescheduled to other nodes.
Recovery: Automatic within minutes if cluster has sufficient capacity.
Action required: Verify all pods are running after rescheduling:
kubectl get pods -n datafly -o wideScenario 3: Database Failure
Impact: Critical. Management API and Identity Hub cannot function. Event ingestion and processing continue using cached data for a limited time.
Recovery:
- For managed services (RDS, Cloud SQL, Azure DB): use point-in-time recovery to a new instance
- Update the connection string in your secrets
- Restart affected pods:
kubectl rollout restart deployment/management-api -n datafly kubectl rollout restart deployment/identity-hub -n datafly
Scenario 4: Kafka Failure
Impact: Event pipeline stalls. Events are buffered at the Ingestion Gateway (in-memory) for a short period.
Recovery:
- For managed services (MSK, Confluent Cloud, Event Hubs): follow the provider’s incident recovery process
- Once Kafka is back, consumers will resume from their committed offsets
- Check consumer lag to verify recovery:
kubectl logs -n datafly -l app.kubernetes.io/name=event-processor --tail=20
Scenario 5: Full Cluster Loss
Impact: Complete outage.
Recovery:
- Provision a new Kubernetes cluster (use your Terraform or cloud CLI scripts)
- Restore PostgreSQL from the latest backup
- Provision new Kafka and Redis instances (or verify managed services are accessible)
- Re-install Datafly Signal via Helm with the same values
- Verify event delivery resumes
Estimated recovery time: 1-4 hours depending on infrastructure provisioning time.
Pre-Upgrade Backup Checklist
Before any upgrade, run this checklist:
# 1. Snapshot the database
# AWS:
aws rds create-db-snapshot \
--db-instance-identifier datafly-postgres \
--db-snapshot-identifier pre-upgrade-$(date +%Y%m%d%H%M)
# 2. Record current Helm release
helm get values datafly -n datafly > values-backup-$(date +%Y%m%d).yaml
helm history datafly -n datafly
# 3. Record current pod versions
kubectl get pods -n datafly -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
# 4. Check Kafka consumer lag is zero
kubectl logs -n datafly -l app.kubernetes.io/name=event-processor --tail=5
# 5. Proceed with upgrade
helm upgrade datafly ...Testing Your Backups
Regularly verify that your backups are restorable:
- Monthly: Restore a database snapshot to a test instance and verify data integrity
- Quarterly: Perform a full disaster recovery drill — restore from backup, deploy Datafly, verify end-to-end event flow
- Before major upgrades: Create and verify a snapshot before proceeding
Untested backups are not backups. Schedule regular restore tests to ensure your recovery procedures work when you need them.
Next Steps
- Set up Observability for monitoring backup health
- Review Upgrades for the pre-upgrade backup process
- See Troubleshooting for recovery from common failures