Architecture

This page explains how Datafly Signal’s components work together, how data flows through the system, and the design principles behind the platform.

Data Flow

                                 Customer's Subdomain
                                (DNS A record → cluster)
                                         |
  ┌──────────┐    HTTP POST     ┌────────┴────────┐
  │          │ ───────────────► │   Ingestion     │
  │ Browser  │                  │   Gateway       │──► Set-Cookie: _dfid
  │          │ ◄─────────────── │   (port 8080)   │
  │ Datafly  │   200 + cookie   │                 │
  │ .js      │                  └────────┬────────┘
  │ (under 8KB)   │                           │
  └──────────┘                  Kafka: raw-events
                                         │
                                ┌────────┴────────┐
                                │ Event Processor │
                                │ (port 8081)     │
                                │                 │
                                │ ┌─────────────┐ │
                                │ │ Org Data    │ │  Layer 1: tenant-wide governance
                                │ │ Layer       │ │  (validation, PII, consent, enrichment)
                                │ └──────┬──────┘ │
                                │ ┌──────┴──────┐ │
                                │ │ Pipeline    │ │  Layer 2: per-vendor transformation
                                │ │ Engine      │ │  (field mapping, formatting, custom logic)
                                │ └─────────────┘ │
                                └────────┬────────┘
                                         │
                          Kafka: delivery-{integration_id}
                                         │
                                ┌────────┴────────┐
                                │ Delivery        │
                                │ Workers         │──► Google Analytics 4 (Measurement Protocol)
                                │ (port 8082)     │──► Meta / Facebook (Conversions API)
                                │                 │──► TikTok (Events API)
                                │ Per-vendor      │──► Google Ads (Enhanced Conversions)
                                │ formatters with │──► Pinterest, Snapchat, LinkedIn ...
                                │ retry + rate    │──► Custom Webhooks
                                │ limiting        │──► BigQuery, Snowflake, S3 ...
                                └─────────────────┘

                Supporting services:
                ┌─────────────────┐  ┌─────────────────┐
                │ Identity Hub    │  │ Management API   │
                │ (port 8083)     │  │ (port 8084)      │
                │                 │  │                  │
                │ Cross-domain    │  │ REST + WebSocket │──► Management UI
                │ identity via    │  │ Admin CRUD, RBAC │    (port 3000)
                │ encrypted       │  │ Real-time event  │
                │ tokens          │  │ debugger         │
                └─────────────────┘  └──────────────────┘

Components

Datafly.js (Client-Side Collector)

A lightweight JavaScript SDK (under 8KB gzipped) loaded on the customer’s website. It:

Collects page views, custom events, and user identifications
Captures ad click IDs from URL parameters (gclid, fbclid, ttclid, etc.)
Generates or reads the _dfid anonymous ID
Reads consent state from the customer’s consent management platform
Sends events via HTTP POST to the customer’s own subdomain endpoint
Stores vendor IDs in IndexedDB and first-party cookies

Datafly.js replaces all client-side vendor tags. The customer loads a single script instead of Google Analytics, Meta Pixel, TikTok Pixel, and others.

Ingestion Gateway

The HTTP entry point for all events. Responsibilities:

Receives events from Datafly.js and server-side API clients
Validates the pipeline key (HMAC verification)
Sets the _dfid anonymous ID cookie via Set-Cookie header (ITP-exempt first-party cookie)
Generates and sets vendor ID cookies (GA4 _ga, Meta _fbp, TikTok _ttp) via Set-Cookie
Enriches events with IP geolocation (MaxMind GeoLite2)
Publishes validated events to the raw-events Kafka topic
Serves the Datafly.js collector file at configurable paths

Event Processor

The core processing engine. Consumes events from raw-events and applies two layers of transformation.

Layer 1 — Org Data Layer (tenant-wide):

Runs on every event from every source before any pipeline-specific logic. Steps in order:

Schema validation
Field standardisation (normalise names across sources)
Data type enforcement
Value normalisation (country codes, currencies, emails)
Data cleansing (trim, strip HTML, remove empties)
PII classification and handling (hash, mask, or strip per policy)
Global enrichments (vendor ID injection from Redis, GeoIP, user agent parsing, session stitching, identity resolution)
Consent enforcement (verify consent state, strip non-consented data)
Event fan-out (one-to-many splitting for retail media use cases)
Field removal (strip fields that must never reach integrations)
Audit trail (attach processing receipt)

Layer 2 — Pipeline Transformation Engine (per-vendor):

Per-source, per-integration data shaping:

Pipeline-global transformations
Pipeline enrichments
Per-integration field mapping
Per-integration enrichments
Per-integration PII handling (e.g., SHA-256 hashing for Meta/TikTok)
Per-integration custom logic (sandboxed expressions)
Output validation

After processing, events are published to per-integration delivery-{integration_id} Kafka topics.

Delivery Workers

Per-vendor workers that consume from delivery-* topics and deliver events to vendor APIs server-to-server. Each vendor worker handles:

API-specific payload formatting
Authentication (API keys, OAuth tokens, etc.)
Retry with exponential backoff
Rate limiting per vendor API constraints
Dead letter queue for persistent failures

Identity Hub

Handles cross-domain identity resolution. When a customer operates multiple domains, the Identity Hub enables identity stitching across them using encrypted tokens. The hub communicates with the browser via a hidden iframe on a shared domain.

Management API

REST and WebSocket API that powers the Management UI and provides programmatic access:

Organisation, user, and RBAC management
Source and integration CRUD
Pipeline and transformation configuration
Audit logging
Real-time event debugger (WebSocket stream of live events)

Management UI

A Next.js 15 admin dashboard for configuring and monitoring the platform:

Source management (pipeline keys, script settings, vendor selection)
Integration configuration (vendor credentials, delivery settings)
Pipeline builder (transformation rules, field mappings)
Real-time event debugger (live WebSocket event stream)
User and role management

Infrastructure Components

Apache Kafka

Kafka is the event streaming backbone. It decouples ingestion from processing and processing from delivery, enabling independent scaling of each stage.

Topics:

Topic	Producer	Consumer	Purpose
`raw-events`	Ingestion Gateway	Event Processor	Unprocessed ingested events
`delivery-{integration_id}`	Event Processor	Delivery Workers	Transformed events ready for vendor delivery
`dead-letter-queue`	Delivery Workers	Monitoring / manual replay	Failed events with error details

The Kafka abstraction layer in the shared library is designed to support a future swap to NATS JetStream for lighter deployments.

Redis

Redis serves multiple purposes across the platform:

Usage	Key Pattern	TTL	Service
Vendor ID store	`vendor_ids:{anonymous_id}`	400 days	Ingestion Gateway, Event Processor
Identity graph	`identity:{anonymous_id}`	400 days	Event Processor, Identity Hub
Session data	`session:{anonymous_id}`	30 minutes	Event Processor
Rate limiting	`ratelimit:{source_id}:{window}`	Per window	Ingestion Gateway
Config cache	`config:{type}:{id}`	5 minutes	All services

Redis is configured with appendonly yes for durability and allkeys-lru eviction for memory management.

PostgreSQL

PostgreSQL is the primary relational store for all configuration and operational data:

Table	Purpose
`organisations`	Tenant accounts
`users`	Admin users with hashed passwords
`roles`	RBAC role definitions
`role_permissions`	Permission grants per role
`sources`	Data collection endpoints with pipeline keys and config
`integrations`	Vendor destinations with credentials and settings
`transformation_files`	YAML/JSON pipeline transformation configs
`org_data_layer`	Tenant-wide processing rules
`api_keys`	API key management
`audit_logs`	Change history for compliance
`environments`	Environment definitions (dev, staging, prod)

Deployment Model

Single-Tenant Architecture

Each customer gets an isolated deployment. In Kubernetes, this means one namespace per customer with dedicated instances of every service. This provides:

Complete data isolation — no shared databases, no shared Kafka topics
Independent scaling — each customer’s event volume is handled by their own resources
Regulatory compliance — data residency and sovereignty requirements met per-customer
Independent upgrades — customers can be upgraded on different schedules

First-Party Data Collection

All data flows through the customer’s own subdomain. The customer creates a DNS A record (not a CNAME) pointing their data collection subdomain to the cluster’s IP address:

data.customer.com  →  A record  →  cluster IP

Using an A record instead of a CNAME is important for Safari ITP compliance. Safari treats CNAME-cloaked third-party domains differently from true first-party A records. With an A record, cookies set by the Ingestion Gateway are treated as genuine first-party cookies with no expiry restrictions.

This means:

The browser sends events to data.customer.com — a domain the customer owns
Cookies are set on the customer’s domain via Set-Cookie headers
No third-party domain appears in network requests
Ad blockers see only first-party traffic
The Datafly.js script is served from the same subdomain (configurable filename to avoid pattern-matching blockers)

Server-Set Cookies

The _dfid anonymous ID and vendor ID cookies (_ga, _fbp, _ttp, etc.) are set by the Ingestion Gateway via Set-Cookie response headers, not by JavaScript. This is significant because:

Safari ITP limits JavaScript-set cookies to 7 days. Server-set cookies on a first-party A record domain have no such limitation.
Cookie consistency — the server is the source of truth for identity, not the browser.
Vendor IDs are stored server-side in Redis, not only in browser cookies. Even if cookies are cleared, the server retains the mapping via the _dfid anonymous ID.

Event Schema

Every event flowing through the system follows a common envelope:

{
  "anonymous_id": "550e8400-e29b-41d4-a716-446655440000",
  "user_id": "user-123",
  "event": "Product Purchased",
  "type": "track",
  "properties": {
    "product_id": "SKU-123",
    "price": 49.99,
    "currency": "USD"
  },
  "context": {
    "ip": "203.0.113.42",
    "user_agent": "Mozilla/5.0 ...",
    "locale": "en-US",
    "page": {
      "url": "https://shop.example.com/checkout",
      "title": "Checkout",
      "referrer": "https://shop.example.com/cart"
    },
    "consent": {
      "analytics": true,
      "marketing": false
    },
    "vendor_ids": {
      "ga_client_id": "1234567890.1706000000",
      "fbp": "fb.1.1706000000.1234567890",
      "ttp": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
    }
  },
  "timestamp": "2026-02-25T10:30:00.000Z",
  "sent_at": "2026-02-25T10:30:00.050Z",
  "message_id": "msg-uuid-here"
}

Field	Description
`anonymous_id`	Datafly-generated UUID, present on every event
`user_id`	Customer-set identifier, present after `identify()`
`event`	Event name (`page` for page views, custom name for `track`)
`type`	One of `page`, `track`, `identify`, `group`
`properties`	Freeform key-value data specific to the event
`context`	Automatically collected metadata (IP, UA, page, consent, vendor IDs)
`timestamp`	Client-side event timestamp (ISO 8601)
`sent_at`	When the collector sent the event (used for clock drift correction)
`message_id`	Unique ID for deduplication

Configuration Overview