Observability

Datafly Signal provides two layers of monitoring: a built-in alerting system that works with any deployment (including single-server setups), and Prometheus-compatible metrics that integrate with your existing monitoring stack (Prometheus, Grafana, Datadog, New Relic, etc.).

Built-in Monitoring & Alerting

Every Datafly Signal deployment includes a built-in monitoring and alerting system that requires no external tools. This is available from the Management UI under Observe > Monitoring and Observe > Alerts.

Monitoring Dashboard

The Monitoring page provides a real-time health overview of your deployment:

Summary cards — total events, success rate, error rate, vendor delivery rate, DLQ pending, active alerts
Event timeline — stacked area chart showing delivered, failed, errored, and filtered events over time
Integration delivery status — per-vendor delivery success rates with colour-coded health indicators
Pipeline health — per-pipeline success/error rates and DLQ counts
Event type breakdown — distribution of event types (page, track, identify, etc.)

Time ranges: 1 hour, 6 hours, 24 hours, 7 days, 30 days, or custom date range. Filter by pipeline. Auto-refreshes every 60 seconds.

Alert Rules

Alert rules evaluate every 60 seconds against your event statistics. When a threshold is breached, notifications are sent to configured channels.

Available metrics:

Metric	Description	Example
`delivery_success_rate`	Ratio of successfully delivered events to total events	Alert when < 90%
`vendor_success_rate`	Per-integration delivery success rate	Alert when Meta CAPI < 80%
`error_rate`	Ratio of errored events to total events	Alert when > 5%
`dlq_depth`	Number of pending dead letter queue events	Alert when > 100
`zero_traffic`	Total event count in the window (zero = problem)	Alert when = 0 for 30 min
`consent_filter_rate`	Ratio of consent-filtered events to total	Monitor for unexpected spikes

Operators: Less than, Greater than, Less than or equal, Greater than or equal, Equal to.

Scoping: Rules can be scoped to the entire organisation, a specific pipeline, a specific integration, or a vendor type.

Cooldowns: Each rule has a configurable cooldown period (default: 60 minutes) to prevent alert storms.

Default rules are created for every new organisation:

Rule	Metric	Condition	Severity
Delivery Success Rate Low	delivery_success_rate	< 90% over 15 min	Warning
Delivery Success Rate Critical	delivery_success_rate	< 50% over 5 min	Critical
DLQ Depth High	dlq_depth	> 100 over 5 min	Warning
Zero Traffic	zero_traffic	= 0 over 30 min	Critical
Error Rate High	error_rate	> 5% over 15 min	Warning

Notification Channels

Alert notifications can be sent via:

Webhook — Send a JSON payload to any URL. Useful for integrating with PagerDuty, OpsGenie, custom systems, or internal tools.

{
  "alert_id": "abc-123",
  "rule_name": "Delivery Success Rate Low",
  "severity": "warning",
  "metric": "delivery_success_rate",
  "value": 0.85,
  "threshold": 0.9,
  "message": "Delivery Success Rate Low: 85.0% is below 90.0% threshold",
  "fired_at": "2026-04-06T14:30:00Z"
}

Slack — Send formatted messages to a Slack channel via incoming webhook. Includes severity badges, metric details, and timestamps.

Email — Send alerts via a webhook-based email service (SendGrid, Resend, Postmark). Configure recipients and the email service webhook URL.

Alert History

All fired alerts are recorded with:

When the alert fired and (if applicable) when it resolved
The metric value at the time of firing
Which notification channels were notified
Acknowledgement status (who acknowledged and when)

Prometheus Metrics

Every Datafly service exposes a /metrics endpoint with Prometheus-compatible metrics. This is the universal interface — any monitoring tool that supports Prometheus (Grafana, Datadog, New Relic, Dynatrace, Splunk, CloudWatch, GCP Managed Prometheus) can scrape these endpoints.

Service	Port	Key Metrics
Ingestion Gateway	8080	Request rate, latency, error rate, payload size
Event Processor	8081	Events processed/sec, Kafka consumer lag, processing latency
Delivery Workers	8082	Deliveries/sec, vendor latency, retry count, DLQ rate
Identity Hub	8082	Lookups/sec, cache hit rate
Management API	8083	API request rate, latency, authentication failures

Standard Metrics (all services)

datafly_info{service, go_version}                    # Build metadata (gauge, always 1)
datafly_uptime_seconds{service}                      # Time since process start
datafly_goroutines                                   # Current goroutine count
datafly_http_requests_total{service, handler, method, status_code}
datafly_http_request_duration_seconds{service, handler, method}
datafly_http_active_connections                       # In-flight connections

Pipeline Metrics

datafly_events_ingested_total{pipeline, type}        # Events received
datafly_events_processed_total{pipeline, status}     # Events processed (ok/filtered/error)
datafly_events_delivered_total{pipeline, vendor, integration_id, status}
datafly_delivery_latency_seconds{vendor}             # End-to-end delivery latency
datafly_delivery_vendor_latency_seconds{vendor}      # Vendor API response time
datafly_consent_filtered_total{pipeline, vendor, category}
datafly_dlq_depth{pipeline}                          # Pending DLQ events
datafly_dlq_events_total{pipeline, reason}           # DLQ events written

Infrastructure Metrics

# Kafka
datafly_kafka_messages_consumed_total{topic, group}
datafly_kafka_consumer_lag{topic, group}
datafly_kafka_consumer_errors_total{topic, group}
datafly_kafka_messages_produced_total{topic}

# PostgreSQL
datafly_pg_pool_total_conns{service}
datafly_pg_pool_idle_conns{service}
datafly_pg_pool_max_conns{service}

# Redis
datafly_redis_commands_total{service, command}
datafly_redis_command_duration_seconds{service, command}

Kubernetes: Helm Observability

The Helm observability features require the Prometheus Operator (kube-prometheus-stack) installed in your cluster. Datafly does not deploy Prometheus or Grafana — it provides the CRDs that integrate with your existing monitoring stack.

Install kube-prometheus-stack

If you don’t have a monitoring stack:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false

Enable Observability

observability:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 15s
    labels: {}
  prometheusRules:
    enabled: true
  grafanaDashboards:
    enabled: true

This deploys:

ServiceMonitor CRDs — one per Datafly service, telling Prometheus where to scrape
PrometheusRule CRDs — alerting rules for common failure conditions
ConfigMap — Grafana dashboard JSON, auto-discovered by Grafana’s sidecar

Prometheus Alerting Rules

Critical Alerts

Alert	Condition	For
`DataflyIngestionDown`	Ingestion Gateway has 0 ready pods	2m
`DataflyProcessingDown`	Event Processor has 0 ready pods	2m
`DataflyDeliveryDown`	All Delivery Workers down	2m
`DataflyIngestionErrorRate`	>5% ingestion errors over 5 min	5m
`DataflyDeliveryFailureRate`	>10% delivery failures over 15 min	15m
`DataflyDeliveryVendorDown`	>50% failure rate for a specific vendor	5m
`DataflyDLQDepthCritical`	DLQ backlog >10,000 events	5m
`DataflyPGPoolExhaustionCritical`	PG pool >95% utilised	2m

Warning Alerts

Alert	Condition	For
`DataflyIngestionLatencyHigh`	p95 latency >500ms	5m
`DataflyKafkaConsumerLag`	Consumer lag >10,000 messages	10m
`DataflyKafkaConsumerLagCritical`	Consumer lag >100,000 messages	5m
`DataflyDeliveryRetryRate`	>10% retry rate	10m
`DataflyDeliveryLatencyHigh`	p95 vendor latency >5 seconds	5m
`DataflyPodRestarts`	>3 restarts in 10 minutes	0m
`DataflyDLQDepthHigh`	DLQ backlog >1,000 events	5m
`DataflyRedisCommandLatency`	p99 Redis latency >100ms	5m
`DataflyMemoryHigh`	Pod memory >90% of limit	10m

Info Alerts

Alert	Condition	For
`DataflyMigrationRunning`	Migration running >5 minutes	5m
`DataflyScalingUp`	HPA scaling up pods	0m

Grafana Dashboards

When grafanaDashboards.enabled: true, six dashboards are deployed:

Datafly Signal Overview — event pipeline funnel, service health, Kafka lag, error rate trends
Ingestion Gateway — RPS by pipeline, latency percentiles, error rate by status code
Event Processor — throughput, Kafka lag by partition, transform duration, consent filtering
Delivery Workers — per-vendor delivery rates, vendor API latency, retry/DLQ breakdown
Infrastructure: Kafka — consumer lag, message throughput, producer/consumer errors
Infrastructure: Database & Cache — PG pool utilisation, query latency, Redis commands/sec

Integration with External Monitoring

Any Prometheus-Compatible Tool

Datafly exposes standard Prometheus annotations on all pods:

prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"

Datadog, New Relic, Dynatrace, and Splunk all have built-in Prometheus scrapers that discover these annotations automatically.

AWS CloudWatch

Use Amazon Managed Prometheus or the CloudWatch Agent to forward metrics from your cluster.

GCP Cloud Monitoring

GKE clusters with Managed Prometheus enabled automatically collect metrics from pods with Prometheus annotations.

Azure Monitor

Use Azure Monitor managed Prometheus with AKS.

Health Check Endpoints

All services expose:

Endpoint	Purpose	Checks
`/healthz`	Liveness probe	Process is alive
`/readyz`	Readiness probe	Dependencies reachable (DB, Kafka, Redis)
`/versionz`	Version info	Build version, git commit
`/metrics`	Prometheus scrape	All service metrics

Next Steps

Review Troubleshooting for resolving common monitoring alerts
Configure Backup & DR for disaster recovery procedures
See Upgrades for monitoring during upgrade rollouts

Azure Migrations