Backup & Disaster Recovery

This guide covers backup strategies, recovery procedures, and RTO/RPO targets for Datafly Signal deployments.

What to Back Up

Datafly Signal stores persistent data in three places:

Store	Data	Criticality	Recovery Impact
PostgreSQL	Organisations, users, sources, integrations, pipelines, audit logs	Critical	Full data loss without backup
Kafka	In-flight events (topics with retention)	Important	Events reprocessed from retention window
Redis	Pipeline key cache, session cache, rate limit counters	Low	Automatically rebuilt on restart

Redis data is ephemeral and does not require backup. Pipeline key lookups and session data are rebuilt from PostgreSQL on restart. Event data in Kafka is transient — once delivered to vendors, events are no longer needed.

RTO / RPO Targets

Tier	RPO (Data Loss)	RTO (Recovery Time)
Standard	1 hour	4 hours
Enhanced	15 minutes	1 hour
High Availability	0 (synchronous replication)	15 minutes

These targets apply to the PostgreSQL database — the single source of truth for all configuration and operational data.

PostgreSQL Backup

AWS RDS

RDS provides automated daily backups. Ensure these are enabled:

# Check backup settings
aws rds describe-db-instances \
  --db-instance-identifier datafly-postgres \
  --query 'DBInstances[0].{BackupRetention:BackupRetentionPeriod,BackupWindow:PreferredBackupWindow}'
 
# Enable automated backups (if not already)
aws rds modify-db-instance \
  --db-instance-identifier datafly-postgres \
  --backup-retention-period 7 \
  --preferred-backup-window "02:00-03:00"

Point-in-time recovery:

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier datafly-postgres \
  --target-db-instance-identifier datafly-postgres-recovery \
  --restore-time "2025-01-15T10:30:00Z"

Manual snapshot:

aws rds create-db-snapshot \
  --db-instance-identifier datafly-postgres \
  --db-snapshot-identifier datafly-pre-upgrade-$(date +%Y%m%d)

GCP Cloud SQL

Cloud SQL provides automated daily backups:

# Check backup configuration
gcloud sql instances describe datafly-postgres \
  --format="value(settings.backupConfiguration)"
 
# Enable automated backups
gcloud sql instances patch datafly-postgres \
  --backup-start-time="02:00" \
  --enable-point-in-time-recovery

Point-in-time recovery:

gcloud sql instances clone datafly-postgres datafly-postgres-recovery \
  --point-in-time="2025-01-15T10:30:00Z"

Manual backup:

gcloud sql backups create --instance=datafly-postgres \
  --description="Pre-upgrade backup"

Azure Database for PostgreSQL

# Check backup configuration (built-in, 7-35 days retention)
az postgres flexible-server show \
  --resource-group datafly-rg --name datafly-postgres \
  --query backup
 
# Point-in-time restore
az postgres flexible-server restore \
  --resource-group datafly-rg \
  --name datafly-postgres-recovery \
  --source-server datafly-postgres \
  --restore-time "2025-01-15T10:30:00Z"

Self-Managed PostgreSQL

Use pg_dump for logical backups:

# Full backup
pg_dump -h localhost -U datafly -d datafly \
  --format=custom --file=datafly-$(date +%Y%m%d).dump
 
# Restore
pg_restore -h localhost -U datafly -d datafly \
  --clean --if-exists datafly-20250115.dump

For continuous archiving (enhanced RPO), configure WAL archiving:

# postgresql.conf
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'

Kafka Backup

Kafka event data is transient. Once events are delivered to vendor APIs, they are no longer needed. Backup is optional and primarily useful for:

Replay during incident recovery
Audit and compliance requirements

Topic Retention

Configure Kafka topic retention to meet your replay requirements:

Topic	Recommended Retention	Purpose
`events.raw`	7 days	Replay from ingestion
`events.processed`	7 days	Replay from processing
`events.delivery.*`	3 days	Redeliver to vendors
`events.dlq`	30 days	Failed delivery investigation

Kafka MirrorMaker (Cross-Region DR)

For multi-region disaster recovery, use Kafka MirrorMaker 2 to replicate topics to a standby cluster:

# mm2.properties
clusters = primary, standby
primary.bootstrap.servers = primary-kafka:9092
standby.bootstrap.servers = standby-kafka:9092
primary->standby.enabled = true
primary->standby.topics = events\\..*

Disaster Recovery Procedures

Scenario 1: Single Pod Failure

Impact: Minimal. Kubernetes automatically restarts the pod.

Recovery: Automatic. PodDisruptionBudgets ensure at least one replica remains available.

Action required: None. Monitor for repeated restarts (see Observability).

Scenario 2: Node Failure

Impact: Pods on the failed node are rescheduled to other nodes.

Recovery: Automatic within minutes if cluster has sufficient capacity.

Action required: Verify all pods are running after rescheduling:

kubectl get pods -n datafly -o wide

Scenario 3: Database Failure

Impact: Critical. Management API and Identity Hub cannot function. Event ingestion and processing continue using cached data for a limited time.

Recovery:

For managed services (RDS, Cloud SQL, Azure DB): use point-in-time recovery to a new instance
Update the connection string in your secrets

Restart affected pods:

kubectl rollout restart deployment/management-api -n datafly
kubectl rollout restart deployment/identity-hub -n datafly

Scenario 4: Kafka Failure

Impact: Event pipeline stalls. Events are buffered at the Ingestion Gateway (in-memory) for a short period.

Recovery:

For managed services (MSK, Confluent Cloud, Event Hubs): follow the provider’s incident recovery process
Once Kafka is back, consumers will resume from their committed offsets

Check consumer lag to verify recovery:

kubectl logs -n datafly -l app.kubernetes.io/name=event-processor --tail=20

Scenario 5: Full Cluster Loss

Impact: Complete outage.

Recovery:

Provision a new Kubernetes cluster (use your Terraform or cloud CLI scripts)
Restore PostgreSQL from the latest backup
Provision new Kafka and Redis instances (or verify managed services are accessible)
Re-install Datafly Signal via Helm with the same values
Verify event delivery resumes

Estimated recovery time: 1-4 hours depending on infrastructure provisioning time.

Pre-Upgrade Backup Checklist

Before any upgrade, run this checklist:

# 1. Snapshot the database
# AWS:
aws rds create-db-snapshot \
  --db-instance-identifier datafly-postgres \
  --db-snapshot-identifier pre-upgrade-$(date +%Y%m%d%H%M)
 
# 2. Record current Helm release
helm get values datafly -n datafly > values-backup-$(date +%Y%m%d).yaml
helm history datafly -n datafly
 
# 3. Record current pod versions
kubectl get pods -n datafly -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
 
# 4. Check Kafka consumer lag is zero
kubectl logs -n datafly -l app.kubernetes.io/name=event-processor --tail=5
 
# 5. Proceed with upgrade
helm upgrade datafly ...

Testing Your Backups

Regularly verify that your backups are restorable:

Monthly: Restore a database snapshot to a test instance and verify data integrity
Quarterly: Perform a full disaster recovery drill — restore from backup, deploy Datafly, verify end-to-end event flow
Before major upgrades: Create and verify a snapshot before proceeding

⚠️

Untested backups are not backups. Schedule regular restore tests to ensure your recovery procedures work when you need them.

Next Steps

Set up Observability for monitoring backup health
Review Upgrades for the pre-upgrade backup process
See Troubleshooting for recovery from common failures

Upgrades Troubleshooting