Observability
Datafly Signal provides two layers of monitoring: a built-in alerting system that works with any deployment (including single-server setups), and Prometheus-compatible metrics that integrate with your existing monitoring stack (Prometheus, Grafana, Datadog, New Relic, etc.).
Built-in Monitoring & Alerting
Every Datafly Signal deployment includes a built-in monitoring and alerting system that requires no external tools. This is available from the Management UI under Observe > Monitoring and Observe > Alerts.
Monitoring Dashboard
The Monitoring page provides a real-time health overview of your deployment:
- Summary cards — total events, success rate, error rate, vendor delivery rate, DLQ pending, active alerts
- Event timeline — stacked area chart showing delivered, failed, errored, and filtered events over time
- Integration delivery status — per-vendor delivery success rates with colour-coded health indicators
- Pipeline health — per-pipeline success/error rates and DLQ counts
- Event type breakdown — distribution of event types (page, track, identify, etc.)
Time ranges: 1 hour, 6 hours, 24 hours, 7 days, 30 days, or custom date range. Filter by pipeline. Auto-refreshes every 60 seconds.
Alert Rules
Alert rules evaluate every 60 seconds against your event statistics. When a threshold is breached, notifications are sent to configured channels.
Available metrics:
| Metric | Description | Example |
|---|---|---|
delivery_success_rate | Ratio of successfully delivered events to total events | Alert when < 90% |
vendor_success_rate | Per-integration delivery success rate | Alert when Meta CAPI < 80% |
error_rate | Ratio of errored events to total events | Alert when > 5% |
dlq_depth | Number of pending dead letter queue events | Alert when > 100 |
zero_traffic | Total event count in the window (zero = problem) | Alert when = 0 for 30 min |
consent_filter_rate | Ratio of consent-filtered events to total | Monitor for unexpected spikes |
Operators: Less than, Greater than, Less than or equal, Greater than or equal, Equal to.
Scoping: Rules can be scoped to the entire organisation, a specific pipeline, a specific integration, or a vendor type.
Cooldowns: Each rule has a configurable cooldown period (default: 60 minutes) to prevent alert storms.
Default rules are created for every new organisation:
| Rule | Metric | Condition | Severity |
|---|---|---|---|
| Delivery Success Rate Low | delivery_success_rate | < 90% over 15 min | Warning |
| Delivery Success Rate Critical | delivery_success_rate | < 50% over 5 min | Critical |
| DLQ Depth High | dlq_depth | > 100 over 5 min | Warning |
| Zero Traffic | zero_traffic | = 0 over 30 min | Critical |
| Error Rate High | error_rate | > 5% over 15 min | Warning |
Notification Channels
Alert notifications can be sent via:
Webhook — Send a JSON payload to any URL. Useful for integrating with PagerDuty, OpsGenie, custom systems, or internal tools.
{
"alert_id": "abc-123",
"rule_name": "Delivery Success Rate Low",
"severity": "warning",
"metric": "delivery_success_rate",
"value": 0.85,
"threshold": 0.9,
"message": "Delivery Success Rate Low: 85.0% is below 90.0% threshold",
"fired_at": "2026-04-06T14:30:00Z"
}Slack — Send formatted messages to a Slack channel via incoming webhook. Includes severity badges, metric details, and timestamps.
Email — Send alerts via a webhook-based email service (SendGrid, Resend, Postmark). Configure recipients and the email service webhook URL.
Alert History
All fired alerts are recorded with:
- When the alert fired and (if applicable) when it resolved
- The metric value at the time of firing
- Which notification channels were notified
- Acknowledgement status (who acknowledged and when)
Prometheus Metrics
Every Datafly service exposes a /metrics endpoint with Prometheus-compatible metrics. This is the universal interface — any monitoring tool that supports Prometheus (Grafana, Datadog, New Relic, Dynatrace, Splunk, CloudWatch, GCP Managed Prometheus) can scrape these endpoints.
| Service | Port | Key Metrics |
|---|---|---|
| Ingestion Gateway | 8080 | Request rate, latency, error rate, payload size |
| Event Processor | 8081 | Events processed/sec, Kafka consumer lag, processing latency |
| Delivery Workers | 8082 | Deliveries/sec, vendor latency, retry count, DLQ rate |
| Identity Hub | 8082 | Lookups/sec, cache hit rate |
| Management API | 8083 | API request rate, latency, authentication failures |
Standard Metrics (all services)
datafly_info{service, go_version} # Build metadata (gauge, always 1)
datafly_uptime_seconds{service} # Time since process start
datafly_goroutines # Current goroutine count
datafly_http_requests_total{service, handler, method, status_code}
datafly_http_request_duration_seconds{service, handler, method}
datafly_http_active_connections # In-flight connectionsPipeline Metrics
datafly_events_ingested_total{pipeline, type} # Events received
datafly_events_processed_total{pipeline, status} # Events processed (ok/filtered/error)
datafly_events_delivered_total{pipeline, vendor, integration_id, status}
datafly_delivery_latency_seconds{vendor} # End-to-end delivery latency
datafly_delivery_vendor_latency_seconds{vendor} # Vendor API response time
datafly_consent_filtered_total{pipeline, vendor, category}
datafly_dlq_depth{pipeline} # Pending DLQ events
datafly_dlq_events_total{pipeline, reason} # DLQ events writtenInfrastructure Metrics
# Kafka
datafly_kafka_messages_consumed_total{topic, group}
datafly_kafka_consumer_lag{topic, group}
datafly_kafka_consumer_errors_total{topic, group}
datafly_kafka_messages_produced_total{topic}
# PostgreSQL
datafly_pg_pool_total_conns{service}
datafly_pg_pool_idle_conns{service}
datafly_pg_pool_max_conns{service}
# Redis
datafly_redis_commands_total{service, command}
datafly_redis_command_duration_seconds{service, command}Kubernetes: Helm Observability
The Helm observability features require the Prometheus Operator (kube-prometheus-stack) installed in your cluster. Datafly does not deploy Prometheus or Grafana — it provides the CRDs that integrate with your existing monitoring stack.
Install kube-prometheus-stack
If you don’t have a monitoring stack:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=falseEnable Observability
observability:
enabled: true
serviceMonitor:
enabled: true
interval: 15s
labels: {}
prometheusRules:
enabled: true
grafanaDashboards:
enabled: trueThis deploys:
- ServiceMonitor CRDs — one per Datafly service, telling Prometheus where to scrape
- PrometheusRule CRDs — alerting rules for common failure conditions
- ConfigMap — Grafana dashboard JSON, auto-discovered by Grafana’s sidecar
Prometheus Alerting Rules
Critical Alerts
| Alert | Condition | For |
|---|---|---|
DataflyIngestionDown | Ingestion Gateway has 0 ready pods | 2m |
DataflyProcessingDown | Event Processor has 0 ready pods | 2m |
DataflyDeliveryDown | All Delivery Workers down | 2m |
DataflyIngestionErrorRate | >5% ingestion errors over 5 min | 5m |
DataflyDeliveryFailureRate | >10% delivery failures over 15 min | 15m |
DataflyDeliveryVendorDown | >50% failure rate for a specific vendor | 5m |
DataflyDLQDepthCritical | DLQ backlog >10,000 events | 5m |
DataflyPGPoolExhaustionCritical | PG pool >95% utilised | 2m |
Warning Alerts
| Alert | Condition | For |
|---|---|---|
DataflyIngestionLatencyHigh | p95 latency >500ms | 5m |
DataflyKafkaConsumerLag | Consumer lag >10,000 messages | 10m |
DataflyKafkaConsumerLagCritical | Consumer lag >100,000 messages | 5m |
DataflyDeliveryRetryRate | >10% retry rate | 10m |
DataflyDeliveryLatencyHigh | p95 vendor latency >5 seconds | 5m |
DataflyPodRestarts | >3 restarts in 10 minutes | 0m |
DataflyDLQDepthHigh | DLQ backlog >1,000 events | 5m |
DataflyRedisCommandLatency | p99 Redis latency >100ms | 5m |
DataflyMemoryHigh | Pod memory >90% of limit | 10m |
Info Alerts
| Alert | Condition | For |
|---|---|---|
DataflyMigrationRunning | Migration running >5 minutes | 5m |
DataflyScalingUp | HPA scaling up pods | 0m |
Grafana Dashboards
When grafanaDashboards.enabled: true, six dashboards are deployed:
- Datafly Signal Overview — event pipeline funnel, service health, Kafka lag, error rate trends
- Ingestion Gateway — RPS by pipeline, latency percentiles, error rate by status code
- Event Processor — throughput, Kafka lag by partition, transform duration, consent filtering
- Delivery Workers — per-vendor delivery rates, vendor API latency, retry/DLQ breakdown
- Infrastructure: Kafka — consumer lag, message throughput, producer/consumer errors
- Infrastructure: Database & Cache — PG pool utilisation, query latency, Redis commands/sec
Integration with External Monitoring
Any Prometheus-Compatible Tool
Datafly exposes standard Prometheus annotations on all pods:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"Datadog, New Relic, Dynatrace, and Splunk all have built-in Prometheus scrapers that discover these annotations automatically.
AWS CloudWatch
Use Amazon Managed Prometheus or the CloudWatch Agent to forward metrics from your cluster.
GCP Cloud Monitoring
GKE clusters with Managed Prometheus enabled automatically collect metrics from pods with Prometheus annotations.
Azure Monitor
Use Azure Monitor managed Prometheus with AKS.
Health Check Endpoints
All services expose:
| Endpoint | Purpose | Checks |
|---|---|---|
/healthz | Liveness probe | Process is alive |
/readyz | Readiness probe | Dependencies reachable (DB, Kafka, Redis) |
/versionz | Version info | Build version, git commit |
/metrics | Prometheus scrape | All service metrics |
Next Steps
- Review Troubleshooting for resolving common monitoring alerts
- Configure Backup & DR for disaster recovery procedures
- See Upgrades for monitoring during upgrade rollouts