Getting StartedArchitecture

Architecture

This page explains how Datafly Signal’s components work together, how data flows through the system, and the design principles behind the platform.

Data Flow

                                 Customer's Subdomain
                                (DNS A record → cluster)
                                         |
  ┌──────────┐    HTTP POST     ┌────────┴────────┐
  │          │ ───────────────► │   Ingestion     │
  │ Browser  │                  │   Gateway       │──► Set-Cookie: _dfid
  │          │ ◄─────────────── │   (port 8080)   │
  │ Datafly  │   200 + cookie   │                 │
  │ .js      │                  └────────┬────────┘
  │ (under 8KB)   │                           │
  └──────────┘                  Kafka: raw-events

                                ┌────────┴────────┐
                                │ Event Processor │
                                │ (port 8081)     │
                                │                 │
                                │ ┌─────────────┐ │
                                │ │ Org Data    │ │  Layer 1: tenant-wide governance
                                │ │ Layer       │ │  (validation, PII, consent, enrichment)
                                │ └──────┬──────┘ │
                                │ ┌──────┴──────┐ │
                                │ │ Pipeline    │ │  Layer 2: per-vendor transformation
                                │ │ Engine      │ │  (field mapping, formatting, custom logic)
                                │ └─────────────┘ │
                                └────────┬────────┘

                          Kafka: delivery-{integration_id}

                                ┌────────┴────────┐
                                │ Delivery        │
                                │ Workers         │──► Google Analytics 4 (Measurement Protocol)
                                │ (port 8082)     │──► Meta / Facebook (Conversions API)
                                │                 │──► TikTok (Events API)
                                │ Per-vendor      │──► Google Ads (Enhanced Conversions)
                                │ formatters with │──► Pinterest, Snapchat, LinkedIn ...
                                │ retry + rate    │──► Custom Webhooks
                                │ limiting        │──► BigQuery, Snowflake, S3 ...
                                └─────────────────┘

                Supporting services:
                ┌─────────────────┐  ┌─────────────────┐
                │ Identity Hub    │  │ Management API   │
                │ (port 8083)     │  │ (port 8084)      │
                │                 │  │                  │
                │ Cross-domain    │  │ REST + WebSocket │──► Management UI
                │ identity via    │  │ Admin CRUD, RBAC │    (port 3000)
                │ encrypted       │  │ Real-time event  │
                │ tokens          │  │ debugger         │
                └─────────────────┘  └──────────────────┘

Components

Datafly.js (Client-Side Collector)

A lightweight JavaScript SDK (under 8KB gzipped) loaded on the customer’s website. It:

  • Collects page views, custom events, and user identifications
  • Captures ad click IDs from URL parameters (gclid, fbclid, ttclid, etc.)
  • Generates or reads the _dfid anonymous ID
  • Reads consent state from the customer’s consent management platform
  • Sends events via HTTP POST to the customer’s own subdomain endpoint
  • Stores vendor IDs in IndexedDB and first-party cookies

Datafly.js replaces all client-side vendor tags. The customer loads a single script instead of Google Analytics, Meta Pixel, TikTok Pixel, and others.

Ingestion Gateway

The HTTP entry point for all events. Responsibilities:

  • Receives events from Datafly.js and server-side API clients
  • Validates the pipeline key (HMAC verification)
  • Sets the _dfid anonymous ID cookie via Set-Cookie header (ITP-exempt first-party cookie)
  • Generates and sets vendor ID cookies (GA4 _ga, Meta _fbp, TikTok _ttp) via Set-Cookie
  • Enriches events with IP geolocation (MaxMind GeoLite2)
  • Publishes validated events to the raw-events Kafka topic
  • Serves the Datafly.js collector file at configurable paths

Event Processor

The core processing engine. Consumes events from raw-events and applies two layers of transformation.

Layer 1 — Org Data Layer (tenant-wide):

Runs on every event from every source before any pipeline-specific logic. Steps in order:

  1. Schema validation
  2. Field standardisation (normalise names across sources)
  3. Data type enforcement
  4. Value normalisation (country codes, currencies, emails)
  5. Data cleansing (trim, strip HTML, remove empties)
  6. PII classification and handling (hash, mask, or strip per policy)
  7. Global enrichments (vendor ID injection from Redis, GeoIP, user agent parsing, session stitching, identity resolution)
  8. Consent enforcement (verify consent state, strip non-consented data)
  9. Event fan-out (one-to-many splitting for retail media use cases)
  10. Field removal (strip fields that must never reach integrations)
  11. Audit trail (attach processing receipt)

Layer 2 — Pipeline Transformation Engine (per-vendor):

Per-source, per-integration data shaping:

  1. Pipeline-global transformations
  2. Pipeline enrichments
  3. Per-integration field mapping
  4. Per-integration enrichments
  5. Per-integration PII handling (e.g., SHA-256 hashing for Meta/TikTok)
  6. Per-integration custom logic (sandboxed expressions)
  7. Output validation

After processing, events are published to per-integration delivery-{integration_id} Kafka topics.

Delivery Workers

Per-vendor workers that consume from delivery-* topics and deliver events to vendor APIs server-to-server. Each vendor worker handles:

  • API-specific payload formatting
  • Authentication (API keys, OAuth tokens, etc.)
  • Retry with exponential backoff
  • Rate limiting per vendor API constraints
  • Dead letter queue for persistent failures

Identity Hub

Handles cross-domain identity resolution. When a customer operates multiple domains, the Identity Hub enables identity stitching across them using encrypted tokens. The hub communicates with the browser via a hidden iframe on a shared domain.

Management API

REST and WebSocket API that powers the Management UI and provides programmatic access:

  • Organisation, user, and RBAC management
  • Source and integration CRUD
  • Pipeline and transformation configuration
  • Audit logging
  • Real-time event debugger (WebSocket stream of live events)

Management UI

A Next.js 15 admin dashboard for configuring and monitoring the platform:

  • Source management (pipeline keys, script settings, vendor selection)
  • Integration configuration (vendor credentials, delivery settings)
  • Pipeline builder (transformation rules, field mappings)
  • Real-time event debugger (live WebSocket event stream)
  • User and role management

Infrastructure Components

Apache Kafka

Kafka is the event streaming backbone. It decouples ingestion from processing and processing from delivery, enabling independent scaling of each stage.

Topics:

TopicProducerConsumerPurpose
raw-eventsIngestion GatewayEvent ProcessorUnprocessed ingested events
delivery-{integration_id}Event ProcessorDelivery WorkersTransformed events ready for vendor delivery
dead-letter-queueDelivery WorkersMonitoring / manual replayFailed events with error details

The Kafka abstraction layer in the shared library is designed to support a future swap to NATS JetStream for lighter deployments.

Redis

Redis serves multiple purposes across the platform:

UsageKey PatternTTLService
Vendor ID storevendor_ids:{anonymous_id}400 daysIngestion Gateway, Event Processor
Identity graphidentity:{anonymous_id}400 daysEvent Processor, Identity Hub
Session datasession:{anonymous_id}30 minutesEvent Processor
Rate limitingratelimit:{source_id}:{window}Per windowIngestion Gateway
Config cacheconfig:{type}:{id}5 minutesAll services

Redis is configured with appendonly yes for durability and allkeys-lru eviction for memory management.

PostgreSQL

PostgreSQL is the primary relational store for all configuration and operational data:

TablePurpose
organisationsTenant accounts
usersAdmin users with hashed passwords
rolesRBAC role definitions
role_permissionsPermission grants per role
sourcesData collection endpoints with pipeline keys and config
integrationsVendor destinations with credentials and settings
transformation_filesYAML/JSON pipeline transformation configs
org_data_layerTenant-wide processing rules
api_keysAPI key management
audit_logsChange history for compliance
environmentsEnvironment definitions (dev, staging, prod)

Deployment Model

Single-Tenant Architecture

Each customer gets an isolated deployment. In Kubernetes, this means one namespace per customer with dedicated instances of every service. This provides:

  • Complete data isolation — no shared databases, no shared Kafka topics
  • Independent scaling — each customer’s event volume is handled by their own resources
  • Regulatory compliance — data residency and sovereignty requirements met per-customer
  • Independent upgrades — customers can be upgraded on different schedules

First-Party Data Collection

All data flows through the customer’s own subdomain. The customer creates a DNS A record (not a CNAME) pointing their data collection subdomain to the cluster’s IP address:

data.customer.com  →  A record  →  cluster IP

Using an A record instead of a CNAME is important for Safari ITP compliance. Safari treats CNAME-cloaked third-party domains differently from true first-party A records. With an A record, cookies set by the Ingestion Gateway are treated as genuine first-party cookies with no expiry restrictions.

This means:

  • The browser sends events to data.customer.com — a domain the customer owns
  • Cookies are set on the customer’s domain via Set-Cookie headers
  • No third-party domain appears in network requests
  • Ad blockers see only first-party traffic
  • The Datafly.js script is served from the same subdomain (configurable filename to avoid pattern-matching blockers)

Server-Set Cookies

The _dfid anonymous ID and vendor ID cookies (_ga, _fbp, _ttp, etc.) are set by the Ingestion Gateway via Set-Cookie response headers, not by JavaScript. This is significant because:

  • Safari ITP limits JavaScript-set cookies to 7 days. Server-set cookies on a first-party A record domain have no such limitation.
  • Cookie consistency — the server is the source of truth for identity, not the browser.
  • Vendor IDs are stored server-side in Redis, not only in browser cookies. Even if cookies are cleared, the server retains the mapping via the _dfid anonymous ID.

Event Schema

Every event flowing through the system follows a common envelope:

{
  "anonymous_id": "550e8400-e29b-41d4-a716-446655440000",
  "user_id": "user-123",
  "event": "Product Purchased",
  "type": "track",
  "properties": {
    "product_id": "SKU-123",
    "price": 49.99,
    "currency": "USD"
  },
  "context": {
    "ip": "203.0.113.42",
    "user_agent": "Mozilla/5.0 ...",
    "locale": "en-US",
    "page": {
      "url": "https://shop.example.com/checkout",
      "title": "Checkout",
      "referrer": "https://shop.example.com/cart"
    },
    "consent": {
      "analytics": true,
      "marketing": false
    },
    "vendor_ids": {
      "ga_client_id": "1234567890.1706000000",
      "fbp": "fb.1.1706000000.1234567890",
      "ttp": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
    }
  },
  "timestamp": "2026-02-25T10:30:00.000Z",
  "sent_at": "2026-02-25T10:30:00.050Z",
  "message_id": "msg-uuid-here"
}
FieldDescription
anonymous_idDatafly-generated UUID, present on every event
user_idCustomer-set identifier, present after identify()
eventEvent name (page for page views, custom name for track)
typeOne of page, track, identify, group
propertiesFreeform key-value data specific to the event
contextAutomatically collected metadata (IP, UA, page, consent, vendor IDs)
timestampClient-side event timestamp (ISO 8601)
sent_atWhen the collector sent the event (used for clock drift correction)
message_idUnique ID for deduplication