Troubleshooting

Common issues and their resolutions for Datafly Signal deployments.

Installation Issues

Preflight check fails

The preflight Job runs before installation and validates connectivity to required services. If it fails, Helm will abort the install.

Diagnosis:

# Check preflight Job logs
kubectl logs -n datafly job/datafly-preflight
 
# Check Job status
kubectl describe job/datafly-preflight -n datafly

Common causes:

Error	Cause	Resolution
`kafka: connection refused`	Kafka not reachable from cluster	Check security groups, VPC peering, and broker endpoint
`redis: connection refused`	Redis not reachable from cluster	Check security groups, ensure Redis is in same VPC
`postgresql: connection refused`	Database not reachable	Check security groups, verify connection string
`kubernetes: version too old`	K8s version < 1.28	Upgrade your cluster to 1.28+
`dns: resolution failed`	Ingress hostname not resolvable	Create DNS records before installing, or skip DNS check

To skip specific checks (not recommended for production):

preflight:
  checks:
    dns: false        # Skip DNS check if records aren't created yet
    outbound: false   # Skip outbound check in air-gapped environments

Pods stuck in `Pending`

Diagnosis:

kubectl describe pod <pod-name> -n datafly

Common causes:

Insufficient resources: Nodes don’t have enough CPU/memory. Scale up your node pool or use smaller resource requests.
No matching node: Node selector or affinity rules prevent scheduling. Check for taints and tolerations.
PVC pending: If using persistent volumes, ensure the storage class exists.

Pods stuck in `ImagePullBackOff`

kubectl describe pod <pod-name> -n datafly | grep -A5 "Events"

Resolution:

Verify the image tag exists: docker pull ghcr.io/datafly/ingestion-gateway:v1.2.0
Check imagePullSecrets in your values file
Ensure network policies allow egress to the container registry

Connectivity Issues

Events not reaching the Ingestion Gateway

Symptoms: Browser console shows network errors when sending events. curl to the ingestion endpoint returns an error.

Diagnosis:

# Check the ingress resource
kubectl get ingress -n datafly
kubectl describe ingress -n datafly
 
# Check the ingestion gateway service
kubectl get svc -n datafly
kubectl get endpoints ingestion-gateway -n datafly
 
# Test from inside the cluster
kubectl run test --rm -it --image=curlimages/curl -- \
  curl -v http://ingestion-gateway:8080/healthz

Common causes:

Symptom	Cause	Resolution
No ADDRESS on ingress	Load balancer not provisioned	Check ingress controller logs, cloud provider quotas
502 Bad Gateway	Pods not ready	Check readiness probes, pod logs
404 Not Found	Wrong path configuration	Verify `ingress.hosts[].paths` in values
Connection timeout	DNS not pointing to load balancer	Check DNS records, TTL may need time to propagate
TLS error	Certificate not issued	Check cert-manager logs, verify ClusterIssuer

Services can’t connect to Kafka

Symptoms: Event Processor or Delivery Workers log connection refused or dial tcp: i/o timeout for Kafka.

# Check pod logs
kubectl logs -n datafly -l app.kubernetes.io/name=event-processor --tail=50
 
# Test connectivity from a pod
kubectl run test --rm -it --image=busybox -- \
  nc -zv <kafka-broker-host> 9092

Resolution:

Verify the KAFKA_BROKERS environment variable is correct
Check security group rules allow traffic from K8s nodes to Kafka brokers
For MSK with TLS: ensure KAFKA_SECURITY_PROTOCOL=TLS or SASL_SSL
For Event Hubs: ensure KAFKA_SECURITY_PROTOCOL=SASL_SSL and KAFKA_SASL_MECHANISM=PLAIN

Services can’t connect to PostgreSQL

# Test from inside the cluster
kubectl run test --rm -it --image=postgres:16 -- \
  psql "postgresql://datafly:password@db-host:5432/datafly?sslmode=require" -c "SELECT 1"

Resolution:

Verify DATABASE_URL in the secret
Check security groups allow traffic from K8s nodes on port 5432
For Cloud SQL with private IP: ensure the GKE cluster is in the same VPC
For Cloud SQL Auth Proxy: check the sidecar container logs

Services can’t connect to Redis

kubectl run test --rm -it --image=redis:7-alpine -- \
  redis-cli -h <redis-host> -p 6379 --tls PING

Resolution:

Verify REDIS_URL in the secret (including rediss:// for TLS)
Check security groups allow traffic on port 6379 (or 6380 for Azure)
For ElastiCache with auth: include the auth token in the URL
For Azure Cache: use port 6380 with TLS

Event Pipeline Issues

Events ingested but not processed

Symptoms: Events are accepted by the Ingestion Gateway (HTTP 200) but don’t appear in vendor dashboards.

Diagnosis:

# Check Event Processor logs
kubectl logs -n datafly -l app.kubernetes.io/name=event-processor --tail=100
 
# Check Kafka consumer lag
# (If using Kafka UI or CLI)
kafka-consumer-groups --bootstrap-server <broker> \
  --describe --group event-processor

Common causes:

Kafka consumer group not consuming: Check for authentication errors in Event Processor logs
Processing errors: Check for transformation errors in logs — a malformed pipeline configuration can cause processing to fail
Consumer lag: If lag is growing, the Event Processor may need more replicas or resources

Events processed but not delivered

Symptoms: Events appear in Kafka events.processed topic but vendors don’t receive them.

# Check Delivery Worker logs
kubectl logs -n datafly -l app.kubernetes.io/name=delivery-worker-webhook --tail=100
 
# Check the DLQ topic for failed deliveries
kafka-console-consumer --bootstrap-server <broker> \
  --topic events.dlq --from-beginning --max-messages 5

Common causes:

Vendor API errors: Check Delivery Worker logs for HTTP status codes from vendor APIs
Network egress blocked: Delivery Workers need outbound HTTPS to vendor APIs (port 443). Check network policies and NAT gateway
Circuit breaker open: If a vendor API is consistently failing, the circuit breaker opens. Check datafly_delivery_circuit_state metric
DLQ overflow: Events failing after max retries are sent to the DLQ

High Kafka consumer lag

Symptoms: datafly_processor_kafka_lag metric is growing.

Resolution:

Check Event Processor resource usage — it may need more CPU/memory
Scale up replicas: increase eventProcessor.replicas or enable HPA
Check for slow processing — transformation errors or complex pipelines can slow throughput
Verify Kafka partition count >= consumer replica count

Migration Issues

Migration init container stuck

Symptoms: Management API pod shows Init:0/1 for an extended period.

# Check init container logs
kubectl logs -n datafly <management-api-pod> -c migrate

Common causes:

Database not reachable: Check connectivity (see above)
Lock contention: Another migration is running. The init container uses a Redis lock to prevent concurrent migrations. Wait for the other migration to complete, or clear the lock:
```
kubectl run redis-cli --rm -it --image=redis:7-alpine -- \
  redis-cli -h <redis-host> DEL migrate:lock
```
Migration SQL error: Check the init container logs for SQL errors. You may need to manually fix the database state (see Migrations)

Dirty migration state

If a migration failed partway through, the database is marked as “dirty”:

# Check migration state
kubectl exec -n datafly deploy/management-api -- \
  /app/migrate version
 
# Force to a specific version after manual fix
kubectl run migrate-fix --rm -it \
  --namespace datafly \
  --image=ghcr.io/datafly/management-api:v1.2.0 \
  --env="DATABASE_URL=postgresql://..." \
  -- /app/migrate force <version>

TLS / Certificate Issues

cert-manager not issuing certificates

# Check Certificate status
kubectl get certificate -n datafly
kubectl describe certificate datafly-tls -n datafly
 
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=50

Common causes:

ClusterIssuer not found: Create the ClusterIssuer for Let’s Encrypt
DNS challenge failed: For wildcard certificates, ensure DNS01 solver is configured
Rate limited: Let’s Encrypt has rate limits — check for “too many certificates” errors
Ingress annotation missing: Ensure cert-manager.io/cluster-issuer annotation is set

Mixed content or CORS errors

Symptoms: Browser console shows mixed content warnings or CORS errors.

Resolution:

Ensure all ingress hosts use HTTPS (check TLS configuration)
Verify CORS_ORIGINS environment variable includes all allowed origins
Check that the COOKIE_DOMAIN matches the ingestion domain

Performance Issues

High ingestion latency

Symptoms: datafly_ingestion_request_duration_seconds p99 > 500ms.

Resolution:

Scale up Ingestion Gateway replicas or enable HPA
Check Kafka producer latency — slow Kafka writes increase ingestion latency
Verify node resources — CPU throttling causes latency spikes
Check for large payloads — events > 1MB are rejected

High memory usage

Symptoms: Pods approaching memory limits, potential OOMKill.

kubectl top pods -n datafly

Resolution:

Increase memory limits in the Helm values
Check for memory leaks — monitor memory over time, not just point-in-time
For Event Processor: reduce batch size if processing large events
For Delivery Workers: check for stuck HTTP connections to slow vendor APIs

Pods being OOMKilled

kubectl get events -n datafly --field-selector reason=OOMKilled
kubectl describe pod <pod-name> -n datafly | grep -A3 "Last State"

Increase memory limits in your values file:

eventProcessor:
  resources:
    limits:
      memory: 2Gi  # Increase from default

Useful Diagnostic Commands

# Overview of all Datafly resources
kubectl get all -n datafly
 
# Pod resource usage
kubectl top pods -n datafly
 
# Recent events (errors, warnings)
kubectl get events -n datafly --sort-by='.lastTimestamp' | tail -20
 
# Logs from all pods of a service
kubectl logs -n datafly -l app.kubernetes.io/name=ingestion-gateway --tail=50
 
# Exec into a pod for debugging
kubectl exec -it -n datafly deploy/management-api -- /bin/sh
 
# Check environment variables
kubectl exec -n datafly deploy/management-api -- env | sort
 
# Describe a pod for detailed status
kubectl describe pod <pod-name> -n datafly
 
# Check network policies
kubectl get networkpolicy -n datafly
 
# Check HPA status
kubectl get hpa -n datafly
kubectl describe hpa -n datafly

Getting Help

If you can’t resolve an issue:

Collect diagnostics:

kubectl get all -n datafly -o yaml > datafly-resources.yaml
kubectl logs -n datafly -l app.kubernetes.io/part-of=datafly --tail=200 > datafly-logs.txt
kubectl get events -n datafly > datafly-events.txt

Check the Datafly community forum for known issues
Contact Datafly support with the diagnostic output

Next Steps

Set up Observability to catch issues before they become outages
Review Upgrades for safe upgrade procedures
Configure Backup & DR for disaster recovery

Backup & DR Overview

Troubleshooting

Installation Issues

Preflight check fails

Pods stuck in Pending

Pods stuck in ImagePullBackOff

Connectivity Issues

Events not reaching the Ingestion Gateway

Services can’t connect to Kafka

Services can’t connect to PostgreSQL

Services can’t connect to Redis

Event Pipeline Issues

Events ingested but not processed

Events processed but not delivered

High Kafka consumer lag

Migration Issues

Migration init container stuck

Dirty migration state

TLS / Certificate Issues

cert-manager not issuing certificates

Mixed content or CORS errors

Performance Issues

High ingestion latency

High memory usage

Pods being OOMKilled

Useful Diagnostic Commands

Getting Help

Next Steps

Pods stuck in `Pending`

Pods stuck in `ImagePullBackOff`