Bot Filtering

Bot filtering detects and removes non-human traffic before events reach your integrations. It runs as Step 3 of the Organisation Data Layer, scoring each event across multiple detection layers and either passing, flagging, or dropping it based on configurable thresholds.

All detection happens server-side — no changes to the Datafly.js collector are required.

How scoring works

Each event is scored from 0.0 (clean) to 1.0 (bot). The score is built by combining signals from three detection layers. Once the final score is computed, the event is classified into one of three actions:

Score range	Action	Behaviour
Below flag threshold (default 0.3)	Pass	Event proceeds normally through the pipeline
Between flag and block threshold (default 0.7)	Flag	Event is tagged with `$is_bot = true` but still delivered to integrations
Above block threshold	Drop	Event is removed entirely and counted as `bot_filtered`

Flagged events are delivered to your integrations with a $is_bot property set to true. This lets you filter them out in your analytics tools if needed, without losing the data entirely.

Detection layers

Layer 1: IAB Pattern Matching

Matches the event’s user-agent string against 160+ known bot patterns sourced from the IAB/ABC International Spiders & Bots List. This includes:

Search engine crawlers (Googlebot, Bingbot, etc.)
SEO tools (Ahrefs, SEMrush, Moz, etc.)
Monitoring services (Pingdom, UptimeRobot, etc.)
Known scrapers and headless browsers

Scoring: A direct match against a known bot pattern scores 1.0 — the event is dropped immediately regardless of thresholds.

You can add your own custom patterns under Settings → Bot Filtering → Detection Layers → IAB Patterns. Custom patterns support wildcards (e.g. my-internal-bot*).

Layer 2: Signal Validation

Checks for the presence of browser signals that real users produce but bots typically lack:

Signal	Score if missing
Screen dimensions (`context.screen`)	+0.3
Timezone (`context.timezone`)	+0.1
Locale / language (`context.locale`)	+0.1
User-agent string	+0.3

A headless bot with no screen dimensions, no timezone, and no locale would score 0.5 — enough to be flagged but not dropped at default thresholds.

Each signal check can be individually enabled or disabled.

Layer 3: IP Classification

Classifies the source IP address against known infrastructure ranges:

Classification	Score
Datacenter IP (AWS, GCP, Azure, etc.)	+0.4
Tor exit node	+0.5
Spamhaus-listed IP	+0.8

Datacenter IPs are common for automated tools and scrapers. A datacenter IP combined with a missing browser signal would push the score above the default block threshold.

⚠️

IP classification scores are additive with other layers. An event from a datacenter IP (0.4) that is also missing screen dimensions (0.3) would score 0.7 — right at the default block threshold.

Configuration

Bot filtering is configured per organisation under Settings → Bot Filtering in the Management UI.

Master toggle

Enable or disable bot filtering entirely. When disabled, all events pass through unscored.

Thresholds

Setting	Default	Description
Flag threshold	0.3	Events scoring above this are tagged `$is_bot = true` but still delivered
Block threshold	0.7	Events scoring above this are dropped

Both values must be between 0.0 and 1.0, and the flag threshold must be lower than the block threshold.

Detection layers

Each detection layer can be independently enabled or disabled:

IAB Patterns — user-agent pattern matching against known bots
Signal Validation — checks for screen dimensions, timezone, locale, and user-agent
IP Classification — datacenter, Tor, and Spamhaus IP detection

Behaviour

Setting	Default	Description
Log blocked events	Off	When enabled, dropped bot events are published to a `bot-events` Kafka topic for analysis
Server-side bypass	On	Server-to-server events authenticated with HMAC skip bot detection entirely

Server-side bypass should normally be left enabled. Server-to-server events (e.g. from your backend via the Server-Side API) are authenticated and trusted — applying bot detection to them would produce false positives since they lack browser signals by design.

Allowlists

You can exempt specific traffic from bot detection:

Allowlisted User Agents — user-agent strings or wildcard patterns (e.g. Datadog Synthetics*) that should bypass bot filtering
Allowlisted IP Ranges — CIDR notation (e.g. 10.0.0.0/8, 192.168.1.0/24) for trusted internal or monitoring IPs

API

Bot filtering can also be managed via the Management API:

# Get current config
curl -H "Authorization: Bearer $TOKEN" \
  https://your-api.example.com/v1/admin/org/bot-filtering
 
# Update config
curl -X PUT \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": true,
    "flag_threshold": 0.3,
    "block_threshold": 0.7,
    "log_blocked_events": false,
    "server_side_bypass": true,
    "layers": {
      "iab_patterns": { "enabled": true, "custom_patterns": [] },
      "signal_validation": {
        "enabled": true,
        "require_screen_dimensions": true,
        "require_timezone": true,
        "require_user_agent": true
      },
      "ip_classification": { "enabled": true }
    },
    "allowlisted_user_agents": [],
    "allowlisted_cidrs": []
  }' \
  https://your-api.example.com/v1/admin/org/bot-filtering

Monitoring bot traffic

Once bot filtering is enabled, you can monitor detected bot traffic in two places:

Observe → Bot Traffic — dedicated page showing bot events over time, bot rate percentage, and the top event names being filtered
Observe → Analytics — the main analytics page includes bot_filtered as a status in the event breakdown

Bot event logging

When Log blocked events is enabled, dropped bot events are published to a bot-events Kafka topic. This allows you to:

Attach a data warehouse integration (e.g. BigQuery, Snowflake) to analyse bot patterns
Build custom alerting on bot traffic spikes
Audit what’s being filtered

The bot events retain all original event data plus the computed bot score and detection details.

Bot event logging uses Datafly’s existing delivery infrastructure. To store bot events in a warehouse, create a pipeline that consumes from the bot-events topic and routes to your preferred storage integration.

Organisation Data Layer Pipeline Engine