Architecture
This page explains how Datafly Signal’s components work together, how data flows through the system, and the design principles behind the platform.
Data Flow
Customer's Subdomain
(DNS A record → cluster)
|
┌──────────┐ HTTP POST ┌────────┴────────┐
│ │ ───────────────► │ Ingestion │
│ Browser │ │ Gateway │──► Set-Cookie: _dfid
│ │ ◄─────────────── │ (port 8080) │
│ Datafly │ 200 + cookie │ │
│ .js │ └────────┬────────┘
│ (under 8KB) │ │
└──────────┘ Kafka: raw-events
│
┌────────┴────────┐
│ Event Processor │
│ (port 8081) │
│ │
│ ┌─────────────┐ │
│ │ Org Data │ │ Layer 1: tenant-wide governance
│ │ Layer │ │ (validation, PII, consent, enrichment)
│ └──────┬──────┘ │
│ ┌──────┴──────┐ │
│ │ Pipeline │ │ Layer 2: per-vendor transformation
│ │ Engine │ │ (field mapping, formatting, custom logic)
│ └─────────────┘ │
└────────┬────────┘
│
Kafka: delivery-{integration_id}
│
┌────────┴────────┐
│ Delivery │
│ Workers │──► Google Analytics 4 (Measurement Protocol)
│ (port 8082) │──► Meta / Facebook (Conversions API)
│ │──► TikTok (Events API)
│ Per-vendor │──► Google Ads (Enhanced Conversions)
│ formatters with │──► Pinterest, Snapchat, LinkedIn ...
│ retry + rate │──► Custom Webhooks
│ limiting │──► BigQuery, Snowflake, S3 ...
└─────────────────┘
Supporting services:
┌─────────────────┐ ┌─────────────────┐
│ Identity Hub │ │ Management API │
│ (port 8083) │ │ (port 8084) │
│ │ │ │
│ Cross-domain │ │ REST + WebSocket │──► Management UI
│ identity via │ │ Admin CRUD, RBAC │ (port 3000)
│ encrypted │ │ Real-time event │
│ tokens │ │ debugger │
└─────────────────┘ └──────────────────┘Components
Datafly.js (Client-Side Collector)
A lightweight JavaScript SDK (under 8KB gzipped) loaded on the customer’s website. It:
- Collects page views, custom events, and user identifications
- Captures ad click IDs from URL parameters (
gclid,fbclid,ttclid, etc.) - Generates or reads the
_dfidanonymous ID - Reads consent state from the customer’s consent management platform
- Sends events via HTTP POST to the customer’s own subdomain endpoint
- Stores vendor IDs in IndexedDB and first-party cookies
Datafly.js replaces all client-side vendor tags. The customer loads a single script instead of Google Analytics, Meta Pixel, TikTok Pixel, and others.
Ingestion Gateway
The HTTP entry point for all events. Responsibilities:
- Receives events from Datafly.js and server-side API clients
- Validates the pipeline key (HMAC verification)
- Sets the
_dfidanonymous ID cookie viaSet-Cookieheader (ITP-exempt first-party cookie) - Generates and sets vendor ID cookies (GA4
_ga, Meta_fbp, TikTok_ttp) viaSet-Cookie - Enriches events with IP geolocation (MaxMind GeoLite2)
- Publishes validated events to the
raw-eventsKafka topic - Serves the Datafly.js collector file at configurable paths
Event Processor
The core processing engine. Consumes events from raw-events and applies two layers of transformation.
Layer 1 — Org Data Layer (tenant-wide):
Runs on every event from every source before any pipeline-specific logic. Steps in order:
- Schema validation
- Field standardisation (normalise names across sources)
- Data type enforcement
- Value normalisation (country codes, currencies, emails)
- Data cleansing (trim, strip HTML, remove empties)
- PII classification and handling (hash, mask, or strip per policy)
- Global enrichments (vendor ID injection from Redis, GeoIP, user agent parsing, session stitching, identity resolution)
- Consent enforcement (verify consent state, strip non-consented data)
- Event fan-out (one-to-many splitting for retail media use cases)
- Field removal (strip fields that must never reach integrations)
- Audit trail (attach processing receipt)
Layer 2 — Pipeline Transformation Engine (per-vendor):
Per-source, per-integration data shaping:
- Pipeline-global transformations
- Pipeline enrichments
- Per-integration field mapping
- Per-integration enrichments
- Per-integration PII handling (e.g., SHA-256 hashing for Meta/TikTok)
- Per-integration custom logic (sandboxed expressions)
- Output validation
After processing, events are published to per-integration delivery-{integration_id} Kafka topics.
Delivery Workers
Per-vendor workers that consume from delivery-* topics and deliver events to vendor APIs server-to-server. Each vendor worker handles:
- API-specific payload formatting
- Authentication (API keys, OAuth tokens, etc.)
- Retry with exponential backoff
- Rate limiting per vendor API constraints
- Dead letter queue for persistent failures
Identity Hub
Handles cross-domain identity resolution. When a customer operates multiple domains, the Identity Hub enables identity stitching across them using encrypted tokens. The hub communicates with the browser via a hidden iframe on a shared domain.
Management API
REST and WebSocket API that powers the Management UI and provides programmatic access:
- Organisation, user, and RBAC management
- Source and integration CRUD
- Pipeline and transformation configuration
- Audit logging
- Real-time event debugger (WebSocket stream of live events)
Management UI
A Next.js 15 admin dashboard for configuring and monitoring the platform:
- Source management (pipeline keys, script settings, vendor selection)
- Integration configuration (vendor credentials, delivery settings)
- Pipeline builder (transformation rules, field mappings)
- Real-time event debugger (live WebSocket event stream)
- User and role management
Infrastructure Components
Apache Kafka
Kafka is the event streaming backbone. It decouples ingestion from processing and processing from delivery, enabling independent scaling of each stage.
Topics:
| Topic | Producer | Consumer | Purpose |
|---|---|---|---|
raw-events | Ingestion Gateway | Event Processor | Unprocessed ingested events |
delivery-{integration_id} | Event Processor | Delivery Workers | Transformed events ready for vendor delivery |
dead-letter-queue | Delivery Workers | Monitoring / manual replay | Failed events with error details |
The Kafka abstraction layer in the shared library is designed to support a future swap to NATS JetStream for lighter deployments.
Redis
Redis serves multiple purposes across the platform:
| Usage | Key Pattern | TTL | Service |
|---|---|---|---|
| Vendor ID store | vendor_ids:{anonymous_id} | 400 days | Ingestion Gateway, Event Processor |
| Identity graph | identity:{anonymous_id} | 400 days | Event Processor, Identity Hub |
| Session data | session:{anonymous_id} | 30 minutes | Event Processor |
| Rate limiting | ratelimit:{source_id}:{window} | Per window | Ingestion Gateway |
| Config cache | config:{type}:{id} | 5 minutes | All services |
Redis is configured with appendonly yes for durability and allkeys-lru eviction for memory management.
PostgreSQL
PostgreSQL is the primary relational store for all configuration and operational data:
| Table | Purpose |
|---|---|
organisations | Tenant accounts |
users | Admin users with hashed passwords |
roles | RBAC role definitions |
role_permissions | Permission grants per role |
sources | Data collection endpoints with pipeline keys and config |
integrations | Vendor destinations with credentials and settings |
transformation_files | YAML/JSON pipeline transformation configs |
org_data_layer | Tenant-wide processing rules |
api_keys | API key management |
audit_logs | Change history for compliance |
environments | Environment definitions (dev, staging, prod) |
Deployment Model
Single-Tenant Architecture
Each customer gets an isolated deployment. In Kubernetes, this means one namespace per customer with dedicated instances of every service. This provides:
- Complete data isolation — no shared databases, no shared Kafka topics
- Independent scaling — each customer’s event volume is handled by their own resources
- Regulatory compliance — data residency and sovereignty requirements met per-customer
- Independent upgrades — customers can be upgraded on different schedules
First-Party Data Collection
All data flows through the customer’s own subdomain. The customer creates a DNS A record (not a CNAME) pointing their data collection subdomain to the cluster’s IP address:
data.customer.com → A record → cluster IPUsing an A record instead of a CNAME is important for Safari ITP compliance. Safari treats CNAME-cloaked third-party domains differently from true first-party A records. With an A record, cookies set by the Ingestion Gateway are treated as genuine first-party cookies with no expiry restrictions.
This means:
- The browser sends events to
data.customer.com— a domain the customer owns - Cookies are set on the customer’s domain via
Set-Cookieheaders - No third-party domain appears in network requests
- Ad blockers see only first-party traffic
- The Datafly.js script is served from the same subdomain (configurable filename to avoid pattern-matching blockers)
Server-Set Cookies
The _dfid anonymous ID and vendor ID cookies (_ga, _fbp, _ttp, etc.) are set by the Ingestion Gateway via Set-Cookie response headers, not by JavaScript. This is significant because:
- Safari ITP limits JavaScript-set cookies to 7 days. Server-set cookies on a first-party A record domain have no such limitation.
- Cookie consistency — the server is the source of truth for identity, not the browser.
- Vendor IDs are stored server-side in Redis, not only in browser cookies. Even if cookies are cleared, the server retains the mapping via the
_dfidanonymous ID.
Event Schema
Every event flowing through the system follows a common envelope:
{
"anonymous_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user-123",
"event": "Product Purchased",
"type": "track",
"properties": {
"product_id": "SKU-123",
"price": 49.99,
"currency": "USD"
},
"context": {
"ip": "203.0.113.42",
"user_agent": "Mozilla/5.0 ...",
"locale": "en-US",
"page": {
"url": "https://shop.example.com/checkout",
"title": "Checkout",
"referrer": "https://shop.example.com/cart"
},
"consent": {
"analytics": true,
"marketing": false
},
"vendor_ids": {
"ga_client_id": "1234567890.1706000000",
"fbp": "fb.1.1706000000.1234567890",
"ttp": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}
},
"timestamp": "2026-02-25T10:30:00.000Z",
"sent_at": "2026-02-25T10:30:00.050Z",
"message_id": "msg-uuid-here"
}| Field | Description |
|---|---|
anonymous_id | Datafly-generated UUID, present on every event |
user_id | Customer-set identifier, present after identify() |
event | Event name (page for page views, custom name for track) |
type | One of page, track, identify, group |
properties | Freeform key-value data specific to the event |
context | Automatically collected metadata (IP, UA, page, consent, vendor IDs) |
timestamp | Client-side event timestamp (ISO 8601) |
sent_at | When the collector sent the event (used for clock drift correction) |
message_id | Unique ID for deduplication |