Organisation Data Layer
The Organisation Data Layer is the first processing stage applied to every event. It runs tenant-wide — the same rules apply regardless of which integrations the event is destined for. The layer consists of 11 steps executed in sequence.
raw event → Schema Validation → Consent → Bot Filtering → PII
→ Geolocation → Device Parsing → Sessions → Identity
→ Deduplication → Custom JS → Routing → enriched eventStep 1: Schema Validation
Validates that the incoming event conforms to the expected structure. Events must have:
- A valid
typefield (track,page,identify,group) - Required fields for the event type (e.g.
eventname for track events) - Valid data types for all fields
- Properly formed timestamps (ISO 8601)
Events that fail validation are sent to a dead-letter topic (dlq-events) with the validation error attached. They are visible in the Management UI’s event debugger.
{
"error": "schema_validation_failed",
"details": "Field 'type' is required",
"original_event": { "..." : "..." }
}Step 2: Consent Enforcement
Checks the event’s consent state against the organisation’s consent configuration. Each event can carry consent signals in its context:
{
"context": {
"consent": {
"analytics": true,
"marketing": false,
"functional": true
}
}
}The consent enforcer evaluates this against the consent requirements for each integration category:
| Action | Description |
|---|---|
| Allow | Consent granted for the required categories; event proceeds |
| Drop | Consent not granted; event is discarded entirely |
| Strip | Event proceeds but PII fields are removed |
If no consent data is present on the event, the organisation’s default consent policy applies. This is configurable in the Management UI under Organisation Settings.
Integrations are tagged with consent categories (e.g. Google Analytics = analytics, Meta = marketing). An event is only routed to an integration if the user has consented to that integration’s category.
Step 3: Bot and Spam Filtering
Filters out non-human traffic using multiple signals:
| Signal | Method |
|---|---|
| User-agent patterns | Match against a curated list of known bot user agents |
| IP reputation | Check against known datacenter IP ranges and threat intelligence feeds |
| Behavioural heuristics | Detect anomalous patterns (e.g. impossibly fast event sequences) |
| Missing browser signals | Flag requests missing standard browser headers |
Detected bot events are dropped by default. You can optionally route them to a separate topic (bot-events) for analysis by enabling bot logging in the organisation settings.
Step 4: PII Detection and Handling
Scans event properties and traits for personally identifiable information. The PII detector supports four handling modes, configurable per field or per pattern:
| Mode | Behaviour |
|---|---|
| pass-through | PII is sent as-is to downstream integrations |
| hash | PII is replaced with a SHA-256 hash (useful for identity matching without exposing raw values) |
| redact | PII is replaced with a placeholder (e.g. [REDACTED]) |
| drop | The entire field is removed from the event |
Auto-Detection Patterns
The PII detector automatically identifies:
- Email addresses (regex pattern matching)
- Phone numbers (international format detection)
- IP addresses (IPv4 and IPv6)
- Credit card numbers (Luhn algorithm validation)
- Social Security Numbers / national IDs (format matching)
- Names in known PII fields (
firstName,lastName,name,fullName)
Configuration
PII handling is configured at the organisation level:
pii_policy:
default_action: hash
rules:
- pattern: "email"
fields: ["traits.email", "properties.email", "properties.user_email"]
action: hash
- pattern: "phone"
fields: ["traits.phone", "properties.phone"]
action: redact
- pattern: "ip_address"
fields: ["context.ip"]
action: pass-through
- pattern: "credit_card"
action: dropPII handling at this layer applies to all integrations. Individual integrations may have additional PII requirements configured in their pipeline. For example, Meta CAPI requires hashed email and phone — the pipeline will hash these even if the org policy is pass-through.
Step 5: IP Geolocation Enrichment
Resolves the client IP address to geographic data using a local MaxMind GeoIP2 database. The enrichment adds a geo object to the event context:
{
"context": {
"geo": {
"country": "US",
"countryCode": "US",
"region": "California",
"regionCode": "CA",
"city": "San Francisco",
"postalCode": "94102",
"latitude": 37.7749,
"longitude": -122.4194,
"timezone": "America/Los_Angeles"
}
}
}The GeoIP database is updated weekly via an automated download. No external API calls are made during event processing.
Step 6: Device and Browser Parsing
Parses the userAgent string into structured device, browser, and operating system data:
{
"context": {
"device": {
"type": "desktop",
"brand": "Apple",
"model": "Macintosh",
"browser": "Chrome",
"browserVersion": "121.0.6167.85",
"os": "macOS",
"osVersion": "14.3",
"engine": "Blink",
"engineVersion": "121.0.6167.85"
}
}
}Device type classification:
| Type | Description |
|---|---|
desktop | Desktop or laptop computer |
mobile | Mobile phone |
tablet | Tablet device |
tv | Smart TV or streaming device |
bot | Known bot (also flagged in step 3) |
unknown | Unrecognised user agent |
Step 7: Session Stitching
Groups events into logical sessions using Redis-backed state. A session is defined by:
- Same anonymous ID or user ID
- 30-minute inactivity timeout (configurable per organisation)
- Midnight boundary — a new session starts at midnight in the user’s timezone
Each event is annotated with session data:
{
"context": {
"session": {
"id": "sess_abc123",
"eventIndex": 5,
"startedAt": "2026-02-25T14:00:00.000Z",
"isNew": false
}
}
}The session stitcher also computes session-level metrics that downstream integrations may need, such as session_engaged for GA4.
Step 8: Identity Resolution
Attaches known identifiers to the event by looking up the anonymous ID and user ID in the Identity Hub (via Redis cache):
{
"vendorIds": {
"ga4_client_id": "1234567890.1740000000",
"fbp": "fb.1.1740000000.987654321",
"ttp": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
},
"clickIds": {
"gclid": "CjwKCAiA...",
"fbclid": "IwAR3..."
}
}Identity resolution:
- Looks up the
_dfidanonymous ID in the identity graph - Retrieves all associated vendor IDs and click IDs
- Merges with any identifiers already present on the event
- If a
userIdis present, ensures the identity graph links it to the anonymous ID
See the Identity section for details on the identity graph.
Step 9: Event Deduplication
Checks whether the event has already been processed using the messageId field as an idempotency key. The deduplication window is stored in Redis with a configurable TTL (default: 24 hours).
| Scenario | Result |
|---|---|
New messageId | Event is processed; messageId is stored in Redis |
Duplicate messageId within TTL | Event is dropped silently |
No messageId | Event is always processed (no deduplication) |
The Datafly.js SDK automatically generates a UUID messageId for every event. For server-side events, you should generate and include your own messageId if you want deduplication protection.
Step 10: Custom JavaScript Execution
Executes organisation-level JavaScript code in a sandboxed V8 isolate. This allows organisations to apply custom business logic that cannot be expressed through configuration alone.
// Example: normalise event names to snake_case
function process(event) {
if (event.event) {
event.event = event.event
.toLowerCase()
.replace(/\s+/g, "_")
.replace(/[^a-z0-9_]/g, "");
}
return event;
}Custom code at this layer runs once per event, before pipeline-level transformations. See Custom Code for the full API reference and security model.
Step 11: Event Routing
Determines which integrations should receive this event based on:
- Source-to-integration connections — which integrations are wired to the event’s source
- Integration-level event filters — per-integration rules (e.g. “only send track events”, “only send events named Purchase”)
- Consent state — integrations whose consent category was not granted are excluded
The routing step produces a list of integration IDs. The event is then passed to the Pipeline Transformation Engine once per routed integration.
{
"_routing": [
"integration_ga4_001",
"integration_meta_001",
"integration_bigquery_001"
]
}If the routing list is empty (no integrations match), the event is still stored for debugging purposes but is not published to any delivery topic.