ProcessingOrganisation Data Layer

Organisation Data Layer

The Organisation Data Layer is the first processing stage applied to every event. It runs tenant-wide — the same rules apply regardless of which integrations the event is destined for. The layer consists of 11 steps executed in sequence.

raw event → Schema Validation → Consent → Bot Filtering → PII
  → Geolocation → Device Parsing → Sessions → Identity
  → Deduplication → Custom JS → Routing → enriched event

Step 1: Schema Validation

Validates that the incoming event conforms to the expected structure. Events must have:

  • A valid type field (track, page, identify, group)
  • Required fields for the event type (e.g. event name for track events)
  • Valid data types for all fields
  • Properly formed timestamps (ISO 8601)

Events that fail validation are sent to a dead-letter topic (dlq-events) with the validation error attached. They are visible in the Management UI’s event debugger.

{
  "error": "schema_validation_failed",
  "details": "Field 'type' is required",
  "original_event": { "..." : "..." }
}

Checks the event’s consent state against the organisation’s consent configuration. Each event can carry consent signals in its context:

{
  "context": {
    "consent": {
      "analytics": true,
      "marketing": false,
      "functional": true
    }
  }
}

The consent enforcer evaluates this against the consent requirements for each integration category:

ActionDescription
AllowConsent granted for the required categories; event proceeds
DropConsent not granted; event is discarded entirely
StripEvent proceeds but PII fields are removed

If no consent data is present on the event, the organisation’s default consent policy applies. This is configurable in the Management UI under Organisation Settings.

Integrations are tagged with consent categories (e.g. Google Analytics = analytics, Meta = marketing). An event is only routed to an integration if the user has consented to that integration’s category.

Step 3: Bot and Spam Filtering

Filters out non-human traffic using multiple signals:

SignalMethod
User-agent patternsMatch against a curated list of known bot user agents
IP reputationCheck against known datacenter IP ranges and threat intelligence feeds
Behavioural heuristicsDetect anomalous patterns (e.g. impossibly fast event sequences)
Missing browser signalsFlag requests missing standard browser headers

Detected bot events are dropped by default. You can optionally route them to a separate topic (bot-events) for analysis by enabling bot logging in the organisation settings.

Step 4: PII Detection and Handling

Scans event properties and traits for personally identifiable information. The PII detector supports four handling modes, configurable per field or per pattern:

ModeBehaviour
pass-throughPII is sent as-is to downstream integrations
hashPII is replaced with a SHA-256 hash (useful for identity matching without exposing raw values)
redactPII is replaced with a placeholder (e.g. [REDACTED])
dropThe entire field is removed from the event

Auto-Detection Patterns

The PII detector automatically identifies:

  • Email addresses (regex pattern matching)
  • Phone numbers (international format detection)
  • IP addresses (IPv4 and IPv6)
  • Credit card numbers (Luhn algorithm validation)
  • Social Security Numbers / national IDs (format matching)
  • Names in known PII fields (firstName, lastName, name, fullName)

Configuration

PII handling is configured at the organisation level:

pii_policy:
  default_action: hash
  rules:
    - pattern: "email"
      fields: ["traits.email", "properties.email", "properties.user_email"]
      action: hash
    - pattern: "phone"
      fields: ["traits.phone", "properties.phone"]
      action: redact
    - pattern: "ip_address"
      fields: ["context.ip"]
      action: pass-through
    - pattern: "credit_card"
      action: drop
⚠️

PII handling at this layer applies to all integrations. Individual integrations may have additional PII requirements configured in their pipeline. For example, Meta CAPI requires hashed email and phone — the pipeline will hash these even if the org policy is pass-through.

Step 5: IP Geolocation Enrichment

Resolves the client IP address to geographic data using a local MaxMind GeoIP2 database. The enrichment adds a geo object to the event context:

{
  "context": {
    "geo": {
      "country": "US",
      "countryCode": "US",
      "region": "California",
      "regionCode": "CA",
      "city": "San Francisco",
      "postalCode": "94102",
      "latitude": 37.7749,
      "longitude": -122.4194,
      "timezone": "America/Los_Angeles"
    }
  }
}

The GeoIP database is updated weekly via an automated download. No external API calls are made during event processing.

Step 6: Device and Browser Parsing

Parses the userAgent string into structured device, browser, and operating system data:

{
  "context": {
    "device": {
      "type": "desktop",
      "brand": "Apple",
      "model": "Macintosh",
      "browser": "Chrome",
      "browserVersion": "121.0.6167.85",
      "os": "macOS",
      "osVersion": "14.3",
      "engine": "Blink",
      "engineVersion": "121.0.6167.85"
    }
  }
}

Device type classification:

TypeDescription
desktopDesktop or laptop computer
mobileMobile phone
tabletTablet device
tvSmart TV or streaming device
botKnown bot (also flagged in step 3)
unknownUnrecognised user agent

Step 7: Session Stitching

Groups events into logical sessions using Redis-backed state. A session is defined by:

  • Same anonymous ID or user ID
  • 30-minute inactivity timeout (configurable per organisation)
  • Midnight boundary — a new session starts at midnight in the user’s timezone

Each event is annotated with session data:

{
  "context": {
    "session": {
      "id": "sess_abc123",
      "eventIndex": 5,
      "startedAt": "2026-02-25T14:00:00.000Z",
      "isNew": false
    }
  }
}

The session stitcher also computes session-level metrics that downstream integrations may need, such as session_engaged for GA4.

Step 8: Identity Resolution

Attaches known identifiers to the event by looking up the anonymous ID and user ID in the Identity Hub (via Redis cache):

{
  "vendorIds": {
    "ga4_client_id": "1234567890.1740000000",
    "fbp": "fb.1.1740000000.987654321",
    "ttp": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
  },
  "clickIds": {
    "gclid": "CjwKCAiA...",
    "fbclid": "IwAR3..."
  }
}

Identity resolution:

  1. Looks up the _dfid anonymous ID in the identity graph
  2. Retrieves all associated vendor IDs and click IDs
  3. Merges with any identifiers already present on the event
  4. If a userId is present, ensures the identity graph links it to the anonymous ID

See the Identity section for details on the identity graph.

Step 9: Event Deduplication

Checks whether the event has already been processed using the messageId field as an idempotency key. The deduplication window is stored in Redis with a configurable TTL (default: 24 hours).

ScenarioResult
New messageIdEvent is processed; messageId is stored in Redis
Duplicate messageId within TTLEvent is dropped silently
No messageIdEvent is always processed (no deduplication)

The Datafly.js SDK automatically generates a UUID messageId for every event. For server-side events, you should generate and include your own messageId if you want deduplication protection.

Step 10: Custom JavaScript Execution

Executes organisation-level JavaScript code in a sandboxed V8 isolate. This allows organisations to apply custom business logic that cannot be expressed through configuration alone.

// Example: normalise event names to snake_case
function process(event) {
  if (event.event) {
    event.event = event.event
      .toLowerCase()
      .replace(/\s+/g, "_")
      .replace(/[^a-z0-9_]/g, "");
  }
  return event;
}

Custom code at this layer runs once per event, before pipeline-level transformations. See Custom Code for the full API reference and security model.

Step 11: Event Routing

Determines which integrations should receive this event based on:

  1. Source-to-integration connections — which integrations are wired to the event’s source
  2. Integration-level event filters — per-integration rules (e.g. “only send track events”, “only send events named Purchase”)
  3. Consent state — integrations whose consent category was not granted are excluded

The routing step produces a list of integration IDs. The event is then passed to the Pipeline Transformation Engine once per routed integration.

{
  "_routing": [
    "integration_ga4_001",
    "integration_meta_001",
    "integration_bigquery_001"
  ]
}

If the routing list is empty (no integrations match), the event is still stored for debugging purposes but is not published to any delivery topic.