DeploymentBackup & DR

Backup & Disaster Recovery

This guide covers backup strategies, recovery procedures, and RTO/RPO targets for Datafly Signal deployments.

What to Back Up

Datafly Signal stores persistent data in three places:

StoreDataCriticalityRecovery Impact
PostgreSQLOrganisations, users, sources, integrations, pipelines, audit logsCriticalFull data loss without backup
KafkaIn-flight events (topics with retention)ImportantEvents reprocessed from retention window
RedisPipeline key cache, session cache, rate limit countersLowAutomatically rebuilt on restart

Redis data is ephemeral and does not require backup. Pipeline key lookups and session data are rebuilt from PostgreSQL on restart. Event data in Kafka is transient — once delivered to vendors, events are no longer needed.

RTO / RPO Targets

TierRPO (Data Loss)RTO (Recovery Time)
Standard1 hour4 hours
Enhanced15 minutes1 hour
High Availability0 (synchronous replication)15 minutes

These targets apply to the PostgreSQL database — the single source of truth for all configuration and operational data.

PostgreSQL Backup

AWS RDS

RDS provides automated daily backups. Ensure these are enabled:

# Check backup settings
aws rds describe-db-instances \
  --db-instance-identifier datafly-postgres \
  --query 'DBInstances[0].{BackupRetention:BackupRetentionPeriod,BackupWindow:PreferredBackupWindow}'
 
# Enable automated backups (if not already)
aws rds modify-db-instance \
  --db-instance-identifier datafly-postgres \
  --backup-retention-period 7 \
  --preferred-backup-window "02:00-03:00"

Point-in-time recovery:

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier datafly-postgres \
  --target-db-instance-identifier datafly-postgres-recovery \
  --restore-time "2025-01-15T10:30:00Z"

Manual snapshot:

aws rds create-db-snapshot \
  --db-instance-identifier datafly-postgres \
  --db-snapshot-identifier datafly-pre-upgrade-$(date +%Y%m%d)

GCP Cloud SQL

Cloud SQL provides automated daily backups:

# Check backup configuration
gcloud sql instances describe datafly-postgres \
  --format="value(settings.backupConfiguration)"
 
# Enable automated backups
gcloud sql instances patch datafly-postgres \
  --backup-start-time="02:00" \
  --enable-point-in-time-recovery

Point-in-time recovery:

gcloud sql instances clone datafly-postgres datafly-postgres-recovery \
  --point-in-time="2025-01-15T10:30:00Z"

Manual backup:

gcloud sql backups create --instance=datafly-postgres \
  --description="Pre-upgrade backup"

Azure Database for PostgreSQL

# Check backup configuration (built-in, 7-35 days retention)
az postgres flexible-server show \
  --resource-group datafly-rg --name datafly-postgres \
  --query backup
 
# Point-in-time restore
az postgres flexible-server restore \
  --resource-group datafly-rg \
  --name datafly-postgres-recovery \
  --source-server datafly-postgres \
  --restore-time "2025-01-15T10:30:00Z"

Self-Managed PostgreSQL

Use pg_dump for logical backups:

# Full backup
pg_dump -h localhost -U datafly -d datafly \
  --format=custom --file=datafly-$(date +%Y%m%d).dump
 
# Restore
pg_restore -h localhost -U datafly -d datafly \
  --clean --if-exists datafly-20250115.dump

For continuous archiving (enhanced RPO), configure WAL archiving:

# postgresql.conf
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'

Kafka Backup

Kafka event data is transient. Once events are delivered to vendor APIs, they are no longer needed. Backup is optional and primarily useful for:

  • Replay during incident recovery
  • Audit and compliance requirements

Topic Retention

Configure Kafka topic retention to meet your replay requirements:

TopicRecommended RetentionPurpose
events.raw7 daysReplay from ingestion
events.processed7 daysReplay from processing
events.delivery.*3 daysRedeliver to vendors
events.dlq30 daysFailed delivery investigation

Kafka MirrorMaker (Cross-Region DR)

For multi-region disaster recovery, use Kafka MirrorMaker 2 to replicate topics to a standby cluster:

# mm2.properties
clusters = primary, standby
primary.bootstrap.servers = primary-kafka:9092
standby.bootstrap.servers = standby-kafka:9092
primary->standby.enabled = true
primary->standby.topics = events\\..*

Disaster Recovery Procedures

Scenario 1: Single Pod Failure

Impact: Minimal. Kubernetes automatically restarts the pod.

Recovery: Automatic. PodDisruptionBudgets ensure at least one replica remains available.

Action required: None. Monitor for repeated restarts (see Observability).

Scenario 2: Node Failure

Impact: Pods on the failed node are rescheduled to other nodes.

Recovery: Automatic within minutes if cluster has sufficient capacity.

Action required: Verify all pods are running after rescheduling:

kubectl get pods -n datafly -o wide

Scenario 3: Database Failure

Impact: Critical. Management API and Identity Hub cannot function. Event ingestion and processing continue using cached data for a limited time.

Recovery:

  1. For managed services (RDS, Cloud SQL, Azure DB): use point-in-time recovery to a new instance
  2. Update the connection string in your secrets
  3. Restart affected pods:
    kubectl rollout restart deployment/management-api -n datafly
    kubectl rollout restart deployment/identity-hub -n datafly

Scenario 4: Kafka Failure

Impact: Event pipeline stalls. Events are buffered at the Ingestion Gateway (in-memory) for a short period.

Recovery:

  1. For managed services (MSK, Confluent Cloud, Event Hubs): follow the provider’s incident recovery process
  2. Once Kafka is back, consumers will resume from their committed offsets
  3. Check consumer lag to verify recovery:
    kubectl logs -n datafly -l app.kubernetes.io/name=event-processor --tail=20

Scenario 5: Full Cluster Loss

Impact: Complete outage.

Recovery:

  1. Provision a new Kubernetes cluster (use your Terraform or cloud CLI scripts)
  2. Restore PostgreSQL from the latest backup
  3. Provision new Kafka and Redis instances (or verify managed services are accessible)
  4. Re-install Datafly Signal via Helm with the same values
  5. Verify event delivery resumes

Estimated recovery time: 1-4 hours depending on infrastructure provisioning time.

Pre-Upgrade Backup Checklist

Before any upgrade, run this checklist:

# 1. Snapshot the database
# AWS:
aws rds create-db-snapshot \
  --db-instance-identifier datafly-postgres \
  --db-snapshot-identifier pre-upgrade-$(date +%Y%m%d%H%M)
 
# 2. Record current Helm release
helm get values datafly -n datafly > values-backup-$(date +%Y%m%d).yaml
helm history datafly -n datafly
 
# 3. Record current pod versions
kubectl get pods -n datafly -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
 
# 4. Check Kafka consumer lag is zero
kubectl logs -n datafly -l app.kubernetes.io/name=event-processor --tail=5
 
# 5. Proceed with upgrade
helm upgrade datafly ...

Testing Your Backups

Regularly verify that your backups are restorable:

  1. Monthly: Restore a database snapshot to a test instance and verify data integrity
  2. Quarterly: Perform a full disaster recovery drill — restore from backup, deploy Datafly, verify end-to-end event flow
  3. Before major upgrades: Create and verify a snapshot before proceeding
⚠️

Untested backups are not backups. Schedule regular restore tests to ensure your recovery procedures work when you need them.

Next Steps