Troubleshooting
Common issues and their resolutions for Datafly Signal deployments.
Installation Issues
Preflight check fails
The preflight Job runs before installation and validates connectivity to required services. If it fails, Helm will abort the install.
Diagnosis:
# Check preflight Job logs
kubectl logs -n datafly job/datafly-preflight
# Check Job status
kubectl describe job/datafly-preflight -n dataflyCommon causes:
| Error | Cause | Resolution |
|---|---|---|
kafka: connection refused | Kafka not reachable from cluster | Check security groups, VPC peering, and broker endpoint |
redis: connection refused | Redis not reachable from cluster | Check security groups, ensure Redis is in same VPC |
postgresql: connection refused | Database not reachable | Check security groups, verify connection string |
kubernetes: version too old | K8s version < 1.28 | Upgrade your cluster to 1.28+ |
dns: resolution failed | Ingress hostname not resolvable | Create DNS records before installing, or skip DNS check |
To skip specific checks (not recommended for production):
preflight:
checks:
dns: false # Skip DNS check if records aren't created yet
outbound: false # Skip outbound check in air-gapped environmentsPods stuck in Pending
Diagnosis:
kubectl describe pod <pod-name> -n dataflyCommon causes:
- Insufficient resources: Nodes don’t have enough CPU/memory. Scale up your node pool or use smaller resource requests.
- No matching node: Node selector or affinity rules prevent scheduling. Check for taints and tolerations.
- PVC pending: If using persistent volumes, ensure the storage class exists.
Pods stuck in ImagePullBackOff
kubectl describe pod <pod-name> -n datafly | grep -A5 "Events"Resolution:
- Verify the image tag exists:
docker pull ghcr.io/datafly/ingestion-gateway:v1.2.0 - Check
imagePullSecretsin your values file - Ensure network policies allow egress to the container registry
Connectivity Issues
Events not reaching the Ingestion Gateway
Symptoms: Browser console shows network errors when sending events. curl to the ingestion endpoint returns an error.
Diagnosis:
# Check the ingress resource
kubectl get ingress -n datafly
kubectl describe ingress -n datafly
# Check the ingestion gateway service
kubectl get svc -n datafly
kubectl get endpoints ingestion-gateway -n datafly
# Test from inside the cluster
kubectl run test --rm -it --image=curlimages/curl -- \
curl -v http://ingestion-gateway:8080/healthzCommon causes:
| Symptom | Cause | Resolution |
|---|---|---|
| No ADDRESS on ingress | Load balancer not provisioned | Check ingress controller logs, cloud provider quotas |
| 502 Bad Gateway | Pods not ready | Check readiness probes, pod logs |
| 404 Not Found | Wrong path configuration | Verify ingress.hosts[].paths in values |
| Connection timeout | DNS not pointing to load balancer | Check DNS records, TTL may need time to propagate |
| TLS error | Certificate not issued | Check cert-manager logs, verify ClusterIssuer |
Services can’t connect to Kafka
Symptoms: Event Processor or Delivery Workers log connection refused or dial tcp: i/o timeout for Kafka.
# Check pod logs
kubectl logs -n datafly -l app.kubernetes.io/name=event-processor --tail=50
# Test connectivity from a pod
kubectl run test --rm -it --image=busybox -- \
nc -zv <kafka-broker-host> 9092Resolution:
- Verify the
KAFKA_BROKERSenvironment variable is correct - Check security group rules allow traffic from K8s nodes to Kafka brokers
- For MSK with TLS: ensure
KAFKA_SECURITY_PROTOCOL=TLSorSASL_SSL - For Event Hubs: ensure
KAFKA_SECURITY_PROTOCOL=SASL_SSLandKAFKA_SASL_MECHANISM=PLAIN
Services can’t connect to PostgreSQL
# Test from inside the cluster
kubectl run test --rm -it --image=postgres:16 -- \
psql "postgresql://datafly:password@db-host:5432/datafly?sslmode=require" -c "SELECT 1"Resolution:
- Verify
DATABASE_URLin the secret - Check security groups allow traffic from K8s nodes on port 5432
- For Cloud SQL with private IP: ensure the GKE cluster is in the same VPC
- For Cloud SQL Auth Proxy: check the sidecar container logs
Services can’t connect to Redis
kubectl run test --rm -it --image=redis:7-alpine -- \
redis-cli -h <redis-host> -p 6379 --tls PINGResolution:
- Verify
REDIS_URLin the secret (includingrediss://for TLS) - Check security groups allow traffic on port 6379 (or 6380 for Azure)
- For ElastiCache with auth: include the auth token in the URL
- For Azure Cache: use port 6380 with TLS
Event Pipeline Issues
Events ingested but not processed
Symptoms: Events are accepted by the Ingestion Gateway (HTTP 200) but don’t appear in vendor dashboards.
Diagnosis:
# Check Event Processor logs
kubectl logs -n datafly -l app.kubernetes.io/name=event-processor --tail=100
# Check Kafka consumer lag
# (If using Kafka UI or CLI)
kafka-consumer-groups --bootstrap-server <broker> \
--describe --group event-processorCommon causes:
- Kafka consumer group not consuming: Check for authentication errors in Event Processor logs
- Processing errors: Check for transformation errors in logs — a malformed pipeline configuration can cause processing to fail
- Consumer lag: If lag is growing, the Event Processor may need more replicas or resources
Events processed but not delivered
Symptoms: Events appear in Kafka events.processed topic but vendors don’t receive them.
# Check Delivery Worker logs
kubectl logs -n datafly -l app.kubernetes.io/name=delivery-worker-webhook --tail=100
# Check the DLQ topic for failed deliveries
kafka-console-consumer --bootstrap-server <broker> \
--topic events.dlq --from-beginning --max-messages 5Common causes:
- Vendor API errors: Check Delivery Worker logs for HTTP status codes from vendor APIs
- Network egress blocked: Delivery Workers need outbound HTTPS to vendor APIs (port 443). Check network policies and NAT gateway
- Circuit breaker open: If a vendor API is consistently failing, the circuit breaker opens. Check
datafly_delivery_circuit_statemetric - DLQ overflow: Events failing after max retries are sent to the DLQ
High Kafka consumer lag
Symptoms: datafly_processor_kafka_lag metric is growing.
Resolution:
- Check Event Processor resource usage — it may need more CPU/memory
- Scale up replicas: increase
eventProcessor.replicasor enable HPA - Check for slow processing — transformation errors or complex pipelines can slow throughput
- Verify Kafka partition count >= consumer replica count
Migration Issues
Migration init container stuck
Symptoms: Management API pod shows Init:0/1 for an extended period.
# Check init container logs
kubectl logs -n datafly <management-api-pod> -c migrateCommon causes:
- Database not reachable: Check connectivity (see above)
- Lock contention: Another migration is running. The init container uses a Redis lock to prevent concurrent migrations. Wait for the other migration to complete, or clear the lock:
kubectl run redis-cli --rm -it --image=redis:7-alpine -- \ redis-cli -h <redis-host> DEL migrate:lock - Migration SQL error: Check the init container logs for SQL errors. You may need to manually fix the database state (see Migrations)
Dirty migration state
If a migration failed partway through, the database is marked as “dirty”:
# Check migration state
kubectl exec -n datafly deploy/management-api -- \
/app/migrate version
# Force to a specific version after manual fix
kubectl run migrate-fix --rm -it \
--namespace datafly \
--image=ghcr.io/datafly/management-api:v1.2.0 \
--env="DATABASE_URL=postgresql://..." \
-- /app/migrate force <version>TLS / Certificate Issues
cert-manager not issuing certificates
# Check Certificate status
kubectl get certificate -n datafly
kubectl describe certificate datafly-tls -n datafly
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=50Common causes:
- ClusterIssuer not found: Create the ClusterIssuer for Let’s Encrypt
- DNS challenge failed: For wildcard certificates, ensure DNS01 solver is configured
- Rate limited: Let’s Encrypt has rate limits — check for “too many certificates” errors
- Ingress annotation missing: Ensure
cert-manager.io/cluster-issuerannotation is set
Mixed content or CORS errors
Symptoms: Browser console shows mixed content warnings or CORS errors.
Resolution:
- Ensure all ingress hosts use HTTPS (check TLS configuration)
- Verify
CORS_ORIGINSenvironment variable includes all allowed origins - Check that the
COOKIE_DOMAINmatches the ingestion domain
Performance Issues
High ingestion latency
Symptoms: datafly_ingestion_request_duration_seconds p99 > 500ms.
Resolution:
- Scale up Ingestion Gateway replicas or enable HPA
- Check Kafka producer latency — slow Kafka writes increase ingestion latency
- Verify node resources — CPU throttling causes latency spikes
- Check for large payloads — events > 1MB are rejected
High memory usage
Symptoms: Pods approaching memory limits, potential OOMKill.
kubectl top pods -n dataflyResolution:
- Increase memory limits in the Helm values
- Check for memory leaks — monitor memory over time, not just point-in-time
- For Event Processor: reduce batch size if processing large events
- For Delivery Workers: check for stuck HTTP connections to slow vendor APIs
Pods being OOMKilled
kubectl get events -n datafly --field-selector reason=OOMKilled
kubectl describe pod <pod-name> -n datafly | grep -A3 "Last State"Increase memory limits in your values file:
eventProcessor:
resources:
limits:
memory: 2Gi # Increase from defaultUseful Diagnostic Commands
# Overview of all Datafly resources
kubectl get all -n datafly
# Pod resource usage
kubectl top pods -n datafly
# Recent events (errors, warnings)
kubectl get events -n datafly --sort-by='.lastTimestamp' | tail -20
# Logs from all pods of a service
kubectl logs -n datafly -l app.kubernetes.io/name=ingestion-gateway --tail=50
# Exec into a pod for debugging
kubectl exec -it -n datafly deploy/management-api -- /bin/sh
# Check environment variables
kubectl exec -n datafly deploy/management-api -- env | sort
# Describe a pod for detailed status
kubectl describe pod <pod-name> -n datafly
# Check network policies
kubectl get networkpolicy -n datafly
# Check HPA status
kubectl get hpa -n datafly
kubectl describe hpa -n dataflyGetting Help
If you can’t resolve an issue:
-
Collect diagnostics:
kubectl get all -n datafly -o yaml > datafly-resources.yaml kubectl logs -n datafly -l app.kubernetes.io/part-of=datafly --tail=200 > datafly-logs.txt kubectl get events -n datafly > datafly-events.txt -
Check the Datafly community forum for known issues
-
Contact Datafly support with the diagnostic output
Next Steps
- Set up Observability to catch issues before they become outages
- Review Upgrades for safe upgrade procedures
- Configure Backup & DR for disaster recovery