DeploymentObservability

Observability

Datafly Signal exposes Prometheus-compatible metrics on every service and ships optional Grafana dashboards and alerting rules. This guide covers setting up monitoring for your deployment.

Metrics Overview

Every Datafly service exposes a /metrics endpoint on its service port with Prometheus-compatible metrics:

ServicePortKey Metrics
Ingestion Gateway8080Request rate, latency, error rate, payload size
Event Processor8081Events processed/sec, Kafka consumer lag, processing latency
Delivery Workers8082Deliveries/sec, vendor latency, retry count, DLQ rate
Identity Hub8082Lookups/sec, cache hit rate, merge operations
Management API8083API request rate, latency, authentication failures
Management UI3000Request rate (static assets)

Prerequisites

The Datafly observability features require the Prometheus Operator (or kube-prometheus-stack) installed in your cluster. Datafly does not deploy Prometheus or Grafana — it provides the CRDs that integrate with your existing monitoring stack.

Install kube-prometheus-stack

If you don’t have a monitoring stack, install kube-prometheus-stack:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false

Setting serviceMonitorSelectorNilUsesHelmValues=false allows Prometheus to discover ServiceMonitors from all namespaces, including the Datafly namespace.

Enabling Observability

Enable the observability features in your Helm values:

observability:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 15s          # Scrape interval (default: 15s)
    labels: {}             # Additional labels for ServiceMonitor discovery
  prometheusRules:
    enabled: true
  grafanaDashboards:
    enabled: true

This deploys:

  1. ServiceMonitor CRDs — one per Datafly service, telling Prometheus where to scrape metrics
  2. PrometheusRule CRDs — alerting rules for common failure conditions
  3. ConfigMap — Grafana dashboard JSON, auto-discovered by Grafana’s sidecar

Alerting Rules

The following alerts are included when prometheusRules.enabled: true:

Critical Alerts

AlertConditionSeverity
DataflyIngestionDownIngestion Gateway has 0 ready pods for 5 minutescritical
DataflyProcessingDownEvent Processor has 0 ready pods for 5 minutescritical
DataflyDeliveryDownAll Delivery Workers have 0 ready pods for 5 minutescritical
DataflyHighErrorRateIngestion error rate > 5% for 10 minutescritical
DataflyDLQBacklogDead letter queue depth > 10,000 for 30 minutescritical

Warning Alerts

AlertConditionSeverity
DataflyHighLatencyp99 ingestion latency > 500ms for 10 minuteswarning
DataflyKafkaLagHighConsumer lag > 50,000 for 15 minuteswarning
DataflyDeliveryRetryRateDelivery retry rate > 10% for 10 minuteswarning
DataflyPodRestartingAny Datafly pod restarted > 3 times in 15 minuteswarning
DataflyDiskPressureKafka broker disk usage > 80%warning
DataflyMemoryHighAny pod using > 90% of memory limit for 10 minuteswarning

Info Alerts

AlertConditionSeverity
DataflyMigrationRunningMigration init container running > 5 minutesinfo
DataflyScalingUpHPA scaling up podsinfo

Grafana Dashboards

When grafanaDashboards.enabled: true, a ConfigMap containing Grafana dashboard JSON is deployed. If Grafana is configured with the dashboard sidecar (default in kube-prometheus-stack), dashboards are automatically imported.

Dashboard: Datafly Signal Overview

The main dashboard provides:

  • Event Pipeline — ingestion rate, processing rate, delivery rate, end-to-end latency
  • Service Health — pod status, restart count, resource usage per service
  • Kafka — consumer lag per topic, partition distribution, throughput
  • Delivery — success/failure rate per vendor, retry breakdown, DLQ depth
  • Infrastructure — CPU, memory, network I/O per pod

Dashboard: Datafly Delivery Detail

Per-vendor delivery metrics:

  • Delivery latency histogram (p50, p90, p99)
  • HTTP status code breakdown
  • Retry attempts distribution
  • Circuit breaker state

Custom Metrics

Datafly services expose the following custom Prometheus metrics that you can use in your own dashboards and alerts:

Ingestion Gateway

# Event ingestion
datafly_ingestion_events_total{source, type}
datafly_ingestion_events_bytes_total{source}
datafly_ingestion_request_duration_seconds{method, path, status}

# Validation
datafly_ingestion_validation_errors_total{source, reason}

Event Processor

# Processing pipeline
datafly_processor_events_total{org, pipeline, status}
datafly_processor_duration_seconds{org, pipeline}
datafly_processor_transformations_total{type, status}

# Kafka consumer
datafly_processor_kafka_lag{topic, partition, consumer_group}
datafly_processor_kafka_messages_total{topic}

Delivery Workers

# Delivery
datafly_delivery_total{vendor, status}
datafly_delivery_duration_seconds{vendor}
datafly_delivery_retries_total{vendor}
datafly_delivery_dlq_total{vendor, reason}

# Circuit breaker
datafly_delivery_circuit_state{vendor}  # 0=closed, 1=half-open, 2=open

Identity Hub

# Identity resolution
datafly_identity_lookups_total{status}
datafly_identity_merges_total
datafly_identity_cache_hits_total
datafly_identity_cache_misses_total

Integration with Cloud Monitoring

AWS CloudWatch

Use the CloudWatch Agent or Amazon Managed Prometheus to forward metrics from your cluster to CloudWatch.

GCP Cloud Monitoring

GKE clusters with Managed Prometheus enabled will automatically collect Prometheus metrics from pods with the standard annotations.

Azure Monitor

Use Azure Monitor managed service for Prometheus with AKS to collect metrics without deploying your own Prometheus.

Health Check Endpoints

All services expose health check endpoints used by Kubernetes probes:

EndpointPurposeChecks
/healthzLiveness probeProcess is alive
/readyzReadiness probeDependencies are reachable (DB, Kafka, Redis)
/metricsPrometheus scrapeAll metrics

Next Steps

  • Review Troubleshooting for resolving common monitoring alerts
  • Configure Backup & DR for disaster recovery procedures
  • See Upgrades for monitoring during upgrade rollouts