Observability
Datafly Signal exposes Prometheus-compatible metrics on every service and ships optional Grafana dashboards and alerting rules. This guide covers setting up monitoring for your deployment.
Metrics Overview
Every Datafly service exposes a /metrics endpoint on its service port with Prometheus-compatible metrics:
| Service | Port | Key Metrics |
|---|---|---|
| Ingestion Gateway | 8080 | Request rate, latency, error rate, payload size |
| Event Processor | 8081 | Events processed/sec, Kafka consumer lag, processing latency |
| Delivery Workers | 8082 | Deliveries/sec, vendor latency, retry count, DLQ rate |
| Identity Hub | 8082 | Lookups/sec, cache hit rate, merge operations |
| Management API | 8083 | API request rate, latency, authentication failures |
| Management UI | 3000 | Request rate (static assets) |
Prerequisites
The Datafly observability features require the Prometheus Operator (or kube-prometheus-stack) installed in your cluster. Datafly does not deploy Prometheus or Grafana — it provides the CRDs that integrate with your existing monitoring stack.
Install kube-prometheus-stack
If you don’t have a monitoring stack, install kube-prometheus-stack:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=falseSetting serviceMonitorSelectorNilUsesHelmValues=false allows Prometheus to discover ServiceMonitors from all namespaces, including the Datafly namespace.
Enabling Observability
Enable the observability features in your Helm values:
observability:
enabled: true
serviceMonitor:
enabled: true
interval: 15s # Scrape interval (default: 15s)
labels: {} # Additional labels for ServiceMonitor discovery
prometheusRules:
enabled: true
grafanaDashboards:
enabled: trueThis deploys:
- ServiceMonitor CRDs — one per Datafly service, telling Prometheus where to scrape metrics
- PrometheusRule CRDs — alerting rules for common failure conditions
- ConfigMap — Grafana dashboard JSON, auto-discovered by Grafana’s sidecar
Alerting Rules
The following alerts are included when prometheusRules.enabled: true:
Critical Alerts
| Alert | Condition | Severity |
|---|---|---|
DataflyIngestionDown | Ingestion Gateway has 0 ready pods for 5 minutes | critical |
DataflyProcessingDown | Event Processor has 0 ready pods for 5 minutes | critical |
DataflyDeliveryDown | All Delivery Workers have 0 ready pods for 5 minutes | critical |
DataflyHighErrorRate | Ingestion error rate > 5% for 10 minutes | critical |
DataflyDLQBacklog | Dead letter queue depth > 10,000 for 30 minutes | critical |
Warning Alerts
| Alert | Condition | Severity |
|---|---|---|
DataflyHighLatency | p99 ingestion latency > 500ms for 10 minutes | warning |
DataflyKafkaLagHigh | Consumer lag > 50,000 for 15 minutes | warning |
DataflyDeliveryRetryRate | Delivery retry rate > 10% for 10 minutes | warning |
DataflyPodRestarting | Any Datafly pod restarted > 3 times in 15 minutes | warning |
DataflyDiskPressure | Kafka broker disk usage > 80% | warning |
DataflyMemoryHigh | Any pod using > 90% of memory limit for 10 minutes | warning |
Info Alerts
| Alert | Condition | Severity |
|---|---|---|
DataflyMigrationRunning | Migration init container running > 5 minutes | info |
DataflyScalingUp | HPA scaling up pods | info |
Grafana Dashboards
When grafanaDashboards.enabled: true, a ConfigMap containing Grafana dashboard JSON is deployed. If Grafana is configured with the dashboard sidecar (default in kube-prometheus-stack), dashboards are automatically imported.
Dashboard: Datafly Signal Overview
The main dashboard provides:
- Event Pipeline — ingestion rate, processing rate, delivery rate, end-to-end latency
- Service Health — pod status, restart count, resource usage per service
- Kafka — consumer lag per topic, partition distribution, throughput
- Delivery — success/failure rate per vendor, retry breakdown, DLQ depth
- Infrastructure — CPU, memory, network I/O per pod
Dashboard: Datafly Delivery Detail
Per-vendor delivery metrics:
- Delivery latency histogram (p50, p90, p99)
- HTTP status code breakdown
- Retry attempts distribution
- Circuit breaker state
Custom Metrics
Datafly services expose the following custom Prometheus metrics that you can use in your own dashboards and alerts:
Ingestion Gateway
# Event ingestion
datafly_ingestion_events_total{source, type}
datafly_ingestion_events_bytes_total{source}
datafly_ingestion_request_duration_seconds{method, path, status}
# Validation
datafly_ingestion_validation_errors_total{source, reason}Event Processor
# Processing pipeline
datafly_processor_events_total{org, pipeline, status}
datafly_processor_duration_seconds{org, pipeline}
datafly_processor_transformations_total{type, status}
# Kafka consumer
datafly_processor_kafka_lag{topic, partition, consumer_group}
datafly_processor_kafka_messages_total{topic}Delivery Workers
# Delivery
datafly_delivery_total{vendor, status}
datafly_delivery_duration_seconds{vendor}
datafly_delivery_retries_total{vendor}
datafly_delivery_dlq_total{vendor, reason}
# Circuit breaker
datafly_delivery_circuit_state{vendor} # 0=closed, 1=half-open, 2=openIdentity Hub
# Identity resolution
datafly_identity_lookups_total{status}
datafly_identity_merges_total
datafly_identity_cache_hits_total
datafly_identity_cache_misses_totalIntegration with Cloud Monitoring
AWS CloudWatch
Use the CloudWatch Agent or Amazon Managed Prometheus to forward metrics from your cluster to CloudWatch.
GCP Cloud Monitoring
GKE clusters with Managed Prometheus enabled will automatically collect Prometheus metrics from pods with the standard annotations.
Azure Monitor
Use Azure Monitor managed service for Prometheus with AKS to collect metrics without deploying your own Prometheus.
Health Check Endpoints
All services expose health check endpoints used by Kubernetes probes:
| Endpoint | Purpose | Checks |
|---|---|---|
/healthz | Liveness probe | Process is alive |
/readyz | Readiness probe | Dependencies are reachable (DB, Kafka, Redis) |
/metrics | Prometheus scrape | All metrics |
Next Steps
- Review Troubleshooting for resolving common monitoring alerts
- Configure Backup & DR for disaster recovery procedures
- See Upgrades for monitoring during upgrade rollouts