Organisation Data Layer

The Organisation Data Layer is the first processing stage applied to every event. It runs tenant-wide — the same rules apply regardless of which integrations the event is destined for. The layer consists of 11 steps executed in sequence.

raw event → Schema Validation → Consent → Bot Filtering → PII
  → Geolocation → Device Parsing → Sessions → Identity
  → Deduplication → Custom JS → Routing → enriched event

Step 1: Schema Validation

Validates that the incoming event conforms to the expected structure. Events must have:

A valid type field (track, page, identify, group)
Required fields for the event type (e.g. event name for track events)
Valid data types for all fields
Properly formed timestamps (ISO 8601)

Events that fail validation are sent to a dead-letter topic (dlq-events) with the validation error attached. They are visible in the Management UI’s event debugger.

{
  "error": "schema_validation_failed",
  "details": "Field 'type' is required",
  "original_event": { "..." : "..." }
}

Checks the event’s consent state against the organisation’s consent configuration. Each event can carry consent signals in its context:

{
  "context": {
    "consent": {
      "analytics": true,
      "marketing": false,
      "functional": true
    }
  }
}

The consent enforcer evaluates this against the consent requirements for each integration category:

Action	Description
Allow	Consent granted for the required categories; event proceeds
Drop	Consent not granted; event is discarded entirely
Strip	Event proceeds but PII fields are removed

If no consent data is present on the event, the organisation’s default consent policy applies. This is configurable in the Management UI under Organisation Settings.

Integrations are tagged with consent categories (e.g. Google Analytics = analytics, Meta = marketing). An event is only routed to an integration if the user has consented to that integration’s category.

Step 3: Bot and Spam Filtering

Scores each event from 0.0 (clean) to 1.0 (bot) using three detection layers:

Layer	Method
IAB Pattern Matching	Match user-agent against 160+ known bot signatures — instant drop on match
Signal Validation	Check for missing browser signals (screen dimensions, timezone, locale)
IP Classification	Classify source IP against datacenter, Tor, and Spamhaus ranges

Events are classified based on their composite score:

Below flag threshold (default 0.3) — pass through normally
Between flag and block thresholds — tagged with $is_bot = true but still delivered
Above block threshold (default 0.7) — dropped and counted as bot_filtered

Dropped bot events can optionally be published to a bot-events Kafka topic for analysis. Server-to-server events authenticated with HMAC bypass bot detection by default.

See Bot Filtering for full configuration details, scoring breakdown, and API reference.

Step 4: PII Detection and Handling

Scans event properties and traits for personally identifiable information. The PII detector supports four handling modes, configurable per field or per pattern:

Mode	Behaviour
pass-through	PII is sent as-is to downstream integrations
hash	PII is replaced with a SHA-256 hash (useful for identity matching without exposing raw values)
redact	PII is replaced with a placeholder (e.g. `[REDACTED]`)
drop	The entire field is removed from the event

Auto-Detection Patterns

The PII detector automatically identifies:

Email addresses (regex pattern matching)
Phone numbers (international format detection)
IP addresses (IPv4 and IPv6)
Credit card numbers (Luhn algorithm validation)
Social Security Numbers / national IDs (format matching)
Names in known PII fields (firstName, lastName, name, fullName)

Configuration

PII handling is configured at the organisation level:

pii_policy:
  default_action: hash
  rules:
    - pattern: "email"
      fields: ["traits.email", "properties.email", "properties.user_email"]
      action: hash
    - pattern: "phone"
      fields: ["traits.phone", "properties.phone"]
      action: redact
    - pattern: "ip_address"
      fields: ["context.ip"]
      action: pass-through
    - pattern: "credit_card"
      action: drop

⚠️

PII handling at this layer applies to all integrations. Individual integrations may have additional PII requirements configured in their pipeline. For example, Meta CAPI requires hashed email and phone — the pipeline will hash these even if the org policy is pass-through.

Step 5: IP Geolocation Enrichment

Resolves the client IP address to geographic data using a local MaxMind GeoIP2 database. The enrichment adds a geo object to the event context:

{
  "context": {
    "geo": {
      "country": "US",
      "countryCode": "US",
      "region": "California",
      "regionCode": "CA",
      "city": "San Francisco",
      "postalCode": "94102",
      "latitude": 37.7749,
      "longitude": -122.4194,
      "timezone": "America/Los_Angeles"
    }
  }
}

The GeoIP database is updated weekly via an automated download. No external API calls are made during event processing.

Step 6: Device and Browser Parsing

Parses the userAgent string into structured device, browser, and operating system data:

{
  "context": {
    "device": {
      "type": "desktop",
      "brand": "Apple",
      "model": "Macintosh",
      "browser": "Chrome",
      "browserVersion": "121.0.6167.85",
      "os": "macOS",
      "osVersion": "14.3",
      "engine": "Blink",
      "engineVersion": "121.0.6167.85"
    }
  }
}

Device type classification:

Type	Description
`desktop`	Desktop or laptop computer
`mobile`	Mobile phone
`tablet`	Tablet device
`tv`	Smart TV or streaming device
`bot`	Known bot (also flagged in step 3)
`unknown`	Unrecognised user agent

Step 7: Session Stitching

Groups events into logical sessions using Redis-backed state. A session is defined by:

Same anonymous ID or user ID
30-minute inactivity timeout (configurable per organisation)
Midnight boundary — a new session starts at midnight in the user’s timezone

Each event is annotated with session data:

{
  "context": {
    "session": {
      "id": "sess_abc123",
      "eventIndex": 5,
      "startedAt": "2026-02-25T14:00:00.000Z",
      "isNew": false
    }
  }
}

The session stitcher also computes session-level metrics that downstream integrations may need, such as session_engaged for GA4.

Step 8: Identity Resolution

Attaches known identifiers to the event by looking up the anonymous ID and user ID in the Identity Hub (via Redis cache):

{
  "vendorIds": {
    "ga4_client_id": "1234567890.1740000000",
    "fbp": "fb.1.1740000000.987654321",
    "ttp": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
  },
  "clickIds": {
    "gclid": "CjwKCAiA...",
    "fbclid": "IwAR3..."
  }
}

Identity resolution:

Looks up the _dfid anonymous ID in the identity graph
Retrieves all associated vendor IDs and click IDs
Merges with any identifiers already present on the event
If a userId is present, ensures the identity graph links it to the anonymous ID

See the Identity section for details on the identity graph.

Step 9: Event Deduplication

Checks whether the event has already been processed using the messageId field as an idempotency key. The deduplication window is stored in Redis with a configurable TTL (default: 24 hours).

Scenario	Result
New `messageId`	Event is processed; `messageId` is stored in Redis
Duplicate `messageId` within TTL	Event is dropped silently
No `messageId`	Event is always processed (no deduplication)

The Datafly.js SDK automatically generates a UUID messageId for every event. For server-side events, you should generate and include your own messageId if you want deduplication protection.

Step 10: Custom JavaScript Execution

Executes organisation-level JavaScript code in a sandboxed V8 isolate. This allows organisations to apply custom business logic that cannot be expressed through configuration alone.

// Example: normalise event names to snake_case
function process(event) {
  if (event.event) {
    event.event = event.event
      .toLowerCase()
      .replace(/\s+/g, "_")
      .replace(/[^a-z0-9_]/g, "");
  }
  return event;
}

Custom code at this layer runs once per event, before pipeline-level transformations. See Custom Code for the full API reference and security model.

Step 11: Event Routing

Determines which integrations should receive this event based on:

Source-to-integration connections — which integrations are wired to the event’s source
Integration-level event filters — per-integration rules (e.g. “only send track events”, “only send events named Purchase”)
Consent state — integrations whose consent category was not granted are excluded

The routing step produces a list of integration IDs. The event is then passed to the Pipeline Transformation Engine once per routed integration.

{
  "_routing": [
    "integration_ga4_001",
    "integration_meta_001",
    "integration_bigquery_001"
  ]
}

If the routing list is empty (no integrations match), the event is still stored for debugging purposes but is not published to any delivery topic.

Overview Bot Filtering