DeploymentObservability

Observability

Datafly Signal provides two layers of monitoring: a built-in alerting system that works with any deployment (including single-server setups), and Prometheus-compatible metrics that integrate with your existing monitoring stack (Prometheus, Grafana, Datadog, New Relic, etc.).

Built-in Monitoring & Alerting

Every Datafly Signal deployment includes a built-in monitoring and alerting system that requires no external tools. This is available from the Management UI under Observe > Monitoring and Observe > Alerts.

Monitoring Dashboard

The Monitoring page provides a real-time health overview of your deployment:

  • Summary cards — total events, success rate, error rate, vendor delivery rate, DLQ pending, active alerts
  • Event timeline — stacked area chart showing delivered, failed, errored, and filtered events over time
  • Integration delivery status — per-vendor delivery success rates with colour-coded health indicators
  • Pipeline health — per-pipeline success/error rates and DLQ counts
  • Event type breakdown — distribution of event types (page, track, identify, etc.)

Time ranges: 1 hour, 6 hours, 24 hours, 7 days, 30 days, or custom date range. Filter by pipeline. Auto-refreshes every 60 seconds.

Alert Rules

Alert rules evaluate every 60 seconds against your event statistics. When a threshold is breached, notifications are sent to configured channels.

Available metrics:

MetricDescriptionExample
delivery_success_rateRatio of successfully delivered events to total eventsAlert when < 90%
vendor_success_ratePer-integration delivery success rateAlert when Meta CAPI < 80%
error_rateRatio of errored events to total eventsAlert when > 5%
dlq_depthNumber of pending dead letter queue eventsAlert when > 100
zero_trafficTotal event count in the window (zero = problem)Alert when = 0 for 30 min
consent_filter_rateRatio of consent-filtered events to totalMonitor for unexpected spikes

Operators: Less than, Greater than, Less than or equal, Greater than or equal, Equal to.

Scoping: Rules can be scoped to the entire organisation, a specific pipeline, a specific integration, or a vendor type.

Cooldowns: Each rule has a configurable cooldown period (default: 60 minutes) to prevent alert storms.

Default rules are created for every new organisation:

RuleMetricConditionSeverity
Delivery Success Rate Lowdelivery_success_rate< 90% over 15 minWarning
Delivery Success Rate Criticaldelivery_success_rate< 50% over 5 minCritical
DLQ Depth Highdlq_depth> 100 over 5 minWarning
Zero Trafficzero_traffic= 0 over 30 minCritical
Error Rate Higherror_rate> 5% over 15 minWarning

Notification Channels

Alert notifications can be sent via:

Webhook — Send a JSON payload to any URL. Useful for integrating with PagerDuty, OpsGenie, custom systems, or internal tools.

{
  "alert_id": "abc-123",
  "rule_name": "Delivery Success Rate Low",
  "severity": "warning",
  "metric": "delivery_success_rate",
  "value": 0.85,
  "threshold": 0.9,
  "message": "Delivery Success Rate Low: 85.0% is below 90.0% threshold",
  "fired_at": "2026-04-06T14:30:00Z"
}

Slack — Send formatted messages to a Slack channel via incoming webhook. Includes severity badges, metric details, and timestamps.

Email — Send alerts via a webhook-based email service (SendGrid, Resend, Postmark). Configure recipients and the email service webhook URL.

Alert History

All fired alerts are recorded with:

  • When the alert fired and (if applicable) when it resolved
  • The metric value at the time of firing
  • Which notification channels were notified
  • Acknowledgement status (who acknowledged and when)

Prometheus Metrics

Every Datafly service exposes a /metrics endpoint with Prometheus-compatible metrics. This is the universal interface — any monitoring tool that supports Prometheus (Grafana, Datadog, New Relic, Dynatrace, Splunk, CloudWatch, GCP Managed Prometheus) can scrape these endpoints.

ServicePortKey Metrics
Ingestion Gateway8080Request rate, latency, error rate, payload size
Event Processor8081Events processed/sec, Kafka consumer lag, processing latency
Delivery Workers8082Deliveries/sec, vendor latency, retry count, DLQ rate
Identity Hub8082Lookups/sec, cache hit rate
Management API8083API request rate, latency, authentication failures

Standard Metrics (all services)

datafly_info{service, go_version}                    # Build metadata (gauge, always 1)
datafly_uptime_seconds{service}                      # Time since process start
datafly_goroutines                                   # Current goroutine count
datafly_http_requests_total{service, handler, method, status_code}
datafly_http_request_duration_seconds{service, handler, method}
datafly_http_active_connections                       # In-flight connections

Pipeline Metrics

datafly_events_ingested_total{pipeline, type}        # Events received
datafly_events_processed_total{pipeline, status}     # Events processed (ok/filtered/error)
datafly_events_delivered_total{pipeline, vendor, integration_id, status}
datafly_delivery_latency_seconds{vendor}             # End-to-end delivery latency
datafly_delivery_vendor_latency_seconds{vendor}      # Vendor API response time
datafly_consent_filtered_total{pipeline, vendor, category}
datafly_dlq_depth{pipeline}                          # Pending DLQ events
datafly_dlq_events_total{pipeline, reason}           # DLQ events written

Infrastructure Metrics

# Kafka
datafly_kafka_messages_consumed_total{topic, group}
datafly_kafka_consumer_lag{topic, group}
datafly_kafka_consumer_errors_total{topic, group}
datafly_kafka_messages_produced_total{topic}

# PostgreSQL
datafly_pg_pool_total_conns{service}
datafly_pg_pool_idle_conns{service}
datafly_pg_pool_max_conns{service}

# Redis
datafly_redis_commands_total{service, command}
datafly_redis_command_duration_seconds{service, command}

Kubernetes: Helm Observability

The Helm observability features require the Prometheus Operator (kube-prometheus-stack) installed in your cluster. Datafly does not deploy Prometheus or Grafana — it provides the CRDs that integrate with your existing monitoring stack.

Install kube-prometheus-stack

If you don’t have a monitoring stack:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false

Enable Observability

observability:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 15s
    labels: {}
  prometheusRules:
    enabled: true
  grafanaDashboards:
    enabled: true

This deploys:

  1. ServiceMonitor CRDs — one per Datafly service, telling Prometheus where to scrape
  2. PrometheusRule CRDs — alerting rules for common failure conditions
  3. ConfigMap — Grafana dashboard JSON, auto-discovered by Grafana’s sidecar

Prometheus Alerting Rules

Critical Alerts

AlertConditionFor
DataflyIngestionDownIngestion Gateway has 0 ready pods2m
DataflyProcessingDownEvent Processor has 0 ready pods2m
DataflyDeliveryDownAll Delivery Workers down2m
DataflyIngestionErrorRate>5% ingestion errors over 5 min5m
DataflyDeliveryFailureRate>10% delivery failures over 15 min15m
DataflyDeliveryVendorDown>50% failure rate for a specific vendor5m
DataflyDLQDepthCriticalDLQ backlog >10,000 events5m
DataflyPGPoolExhaustionCriticalPG pool >95% utilised2m

Warning Alerts

AlertConditionFor
DataflyIngestionLatencyHighp95 latency >500ms5m
DataflyKafkaConsumerLagConsumer lag >10,000 messages10m
DataflyKafkaConsumerLagCriticalConsumer lag >100,000 messages5m
DataflyDeliveryRetryRate>10% retry rate10m
DataflyDeliveryLatencyHighp95 vendor latency >5 seconds5m
DataflyPodRestarts>3 restarts in 10 minutes0m
DataflyDLQDepthHighDLQ backlog >1,000 events5m
DataflyRedisCommandLatencyp99 Redis latency >100ms5m
DataflyMemoryHighPod memory >90% of limit10m

Info Alerts

AlertConditionFor
DataflyMigrationRunningMigration running >5 minutes5m
DataflyScalingUpHPA scaling up pods0m

Grafana Dashboards

When grafanaDashboards.enabled: true, six dashboards are deployed:

  1. Datafly Signal Overview — event pipeline funnel, service health, Kafka lag, error rate trends
  2. Ingestion Gateway — RPS by pipeline, latency percentiles, error rate by status code
  3. Event Processor — throughput, Kafka lag by partition, transform duration, consent filtering
  4. Delivery Workers — per-vendor delivery rates, vendor API latency, retry/DLQ breakdown
  5. Infrastructure: Kafka — consumer lag, message throughput, producer/consumer errors
  6. Infrastructure: Database & Cache — PG pool utilisation, query latency, Redis commands/sec

Integration with External Monitoring

Any Prometheus-Compatible Tool

Datafly exposes standard Prometheus annotations on all pods:

prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"

Datadog, New Relic, Dynatrace, and Splunk all have built-in Prometheus scrapers that discover these annotations automatically.

AWS CloudWatch

Use Amazon Managed Prometheus or the CloudWatch Agent to forward metrics from your cluster.

GCP Cloud Monitoring

GKE clusters with Managed Prometheus enabled automatically collect metrics from pods with Prometheus annotations.

Azure Monitor

Use Azure Monitor managed Prometheus with AKS.

Health Check Endpoints

All services expose:

EndpointPurposeChecks
/healthzLiveness probeProcess is alive
/readyzReadiness probeDependencies reachable (DB, Kafka, Redis)
/versionzVersion infoBuild version, git commit
/metricsPrometheus scrapeAll service metrics

Next Steps

  • Review Troubleshooting for resolving common monitoring alerts
  • Configure Backup & DR for disaster recovery procedures
  • See Upgrades for monitoring during upgrade rollouts