Bot Filtering
Bot filtering detects and removes non-human traffic before events reach your integrations. It runs as Step 3 of the Organisation Data Layer, scoring each event across multiple detection layers and either passing, flagging, or dropping it based on configurable thresholds.
All detection happens server-side — no changes to the Datafly.js collector are required.
How scoring works
Each event is scored from 0.0 (clean) to 1.0 (bot). The score is built by combining signals from three detection layers. Once the final score is computed, the event is classified into one of three actions:
| Score range | Action | Behaviour |
|---|---|---|
| Below flag threshold (default 0.3) | Pass | Event proceeds normally through the pipeline |
| Between flag and block threshold (default 0.7) | Flag | Event is tagged with $is_bot = true but still delivered to integrations |
| Above block threshold | Drop | Event is removed entirely and counted as bot_filtered |
Flagged events are delivered to your integrations with a $is_bot property set to true. This lets you filter them out in your analytics tools if needed, without losing the data entirely.
Detection layers
Layer 1: IAB Pattern Matching
Matches the event’s user-agent string against 160+ known bot patterns sourced from the IAB/ABC International Spiders & Bots List. This includes:
- Search engine crawlers (Googlebot, Bingbot, etc.)
- SEO tools (Ahrefs, SEMrush, Moz, etc.)
- Monitoring services (Pingdom, UptimeRobot, etc.)
- Known scrapers and headless browsers
Scoring: A direct match against a known bot pattern scores 1.0 — the event is dropped immediately regardless of thresholds.
You can add your own custom patterns under Settings → Bot Filtering → Detection Layers → IAB Patterns. Custom patterns support wildcards (e.g. my-internal-bot*).
Layer 2: Signal Validation
Checks for the presence of browser signals that real users produce but bots typically lack:
| Signal | Score if missing |
|---|---|
Screen dimensions (context.screen) | +0.3 |
Timezone (context.timezone) | +0.1 |
Locale / language (context.locale) | +0.1 |
| User-agent string | +0.3 |
A headless bot with no screen dimensions, no timezone, and no locale would score 0.5 — enough to be flagged but not dropped at default thresholds.
Each signal check can be individually enabled or disabled.
Layer 3: IP Classification
Classifies the source IP address against known infrastructure ranges:
| Classification | Score |
|---|---|
| Datacenter IP (AWS, GCP, Azure, etc.) | +0.4 |
| Tor exit node | +0.5 |
| Spamhaus-listed IP | +0.8 |
Datacenter IPs are common for automated tools and scrapers. A datacenter IP combined with a missing browser signal would push the score above the default block threshold.
IP classification scores are additive with other layers. An event from a datacenter IP (0.4) that is also missing screen dimensions (0.3) would score 0.7 — right at the default block threshold.
Configuration
Bot filtering is configured per organisation under Settings → Bot Filtering in the Management UI.
Master toggle
Enable or disable bot filtering entirely. When disabled, all events pass through unscored.
Thresholds
| Setting | Default | Description |
|---|---|---|
| Flag threshold | 0.3 | Events scoring above this are tagged $is_bot = true but still delivered |
| Block threshold | 0.7 | Events scoring above this are dropped |
Both values must be between 0.0 and 1.0, and the flag threshold must be lower than the block threshold.
Detection layers
Each detection layer can be independently enabled or disabled:
- IAB Patterns — user-agent pattern matching against known bots
- Signal Validation — checks for screen dimensions, timezone, locale, and user-agent
- IP Classification — datacenter, Tor, and Spamhaus IP detection
Behaviour
| Setting | Default | Description |
|---|---|---|
| Log blocked events | Off | When enabled, dropped bot events are published to a bot-events Kafka topic for analysis |
| Server-side bypass | On | Server-to-server events authenticated with HMAC skip bot detection entirely |
Server-side bypass should normally be left enabled. Server-to-server events (e.g. from your backend via the Server-Side API) are authenticated and trusted — applying bot detection to them would produce false positives since they lack browser signals by design.
Allowlists
You can exempt specific traffic from bot detection:
- Allowlisted User Agents — user-agent strings or wildcard patterns (e.g.
Datadog Synthetics*) that should bypass bot filtering - Allowlisted IP Ranges — CIDR notation (e.g.
10.0.0.0/8,192.168.1.0/24) for trusted internal or monitoring IPs
API
Bot filtering can also be managed via the Management API:
# Get current config
curl -H "Authorization: Bearer $TOKEN" \
https://your-api.example.com/v1/admin/org/bot-filtering
# Update config
curl -X PUT \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"enabled": true,
"flag_threshold": 0.3,
"block_threshold": 0.7,
"log_blocked_events": false,
"server_side_bypass": true,
"layers": {
"iab_patterns": { "enabled": true, "custom_patterns": [] },
"signal_validation": {
"enabled": true,
"require_screen_dimensions": true,
"require_timezone": true,
"require_user_agent": true
},
"ip_classification": { "enabled": true }
},
"allowlisted_user_agents": [],
"allowlisted_cidrs": []
}' \
https://your-api.example.com/v1/admin/org/bot-filteringMonitoring bot traffic
Once bot filtering is enabled, you can monitor detected bot traffic in two places:
- Observe → Bot Traffic — dedicated page showing bot events over time, bot rate percentage, and the top event names being filtered
- Observe → Analytics — the main analytics page includes
bot_filteredas a status in the event breakdown
Bot event logging
When Log blocked events is enabled, dropped bot events are published to a bot-events Kafka topic. This allows you to:
- Attach a data warehouse integration (e.g. BigQuery, Snowflake) to analyse bot patterns
- Build custom alerting on bot traffic spikes
- Audit what’s being filtered
The bot events retain all original event data plus the computed bot score and detection details.
Bot event logging uses Datafly’s existing delivery infrastructure. To store bot events in a warehouse, create a pipeline that consumes from the bot-events topic and routes to your preferred storage integration.