InfrastructureScaling for High Volume

Scaling for High Volume

Signal scales via configuration — the same code and Docker images handle 1,000 events/sec and 2,000,000+ events/sec. The difference is environment variable settings and infrastructure sizing.

Throughput Tiers

TierPartitionsConcurrencyDelivery ModeMax Events/sec
Default33Single topic, 10 workers100,000–150,000
Medium1212Single topic, 50 workers500,000
Large4848Per-vendor topics, 100 workers each2,000,000+
Maximum9696Per-vendor topics, 200 workers each5,000,000+

The default configuration handles most customers. Only increase when you’re consistently above 50,000 events/sec.

Scaling Environment Variables

These variables control throughput. All are optional — defaults are safe for small-to-medium deployments.

Kafka & Processing

VariableDefaultDescription
KAFKA_NUM_PARTITIONS3Number of partitions per Kafka topic. Set in infrastructure compose. More partitions = more concurrent consumers.
CONSUMER_CONCURRENCY3Number of concurrent message handlers in the event processor. Should match partition count.
DELIVERY_CONCURRENCY10Number of concurrent delivery goroutines in the delivery workers.
REDIS_POOL_SIZE100Redis connection pool size. Increase for high-concurrency deployments.
DB_MAX_CONNS50PostgreSQL max connection pool size.

Per-Vendor Delivery Topics

VariableDefaultDescription
DELIVERY_TOPIC_MODEsingleSet to per_vendor to publish delivery events to separate topics per vendor type (e.g. delivery-ga4, delivery-meta_capi).
VENDOR_TYPE(empty)When DELIVERY_TOPIC_MODE=per_vendor, set this on each delivery worker instance to consume from a specific vendor topic (e.g. ga4, meta_capi, tiktok).

When DELIVERY_TOPIC_MODE=per_vendor, deploy one delivery worker instance per vendor type. Each instance sets VENDOR_TYPE to its vendor. This allows independent scaling — a slow vendor (rate-limited API) doesn’t block other vendors.

Bot Pre-Filter

VariableDefaultDescription
BOT_PREFILTERfalseSet to true to run bot detection in a separate service before the main event processor.

When enabled, deploy the botfilter binary (from event-processor/cmd/botfilter/) as a separate service. It reads from raw-events, drops bots, and publishes clean events to filtered-events. The main event processor then consumes from filtered-events instead of raw-events.

This is beneficial when bot traffic exceeds 30% of total volume — the event processor only handles real events, reducing CPU and Kafka throughput.

Adaptive Delivery

The delivery workers automatically manage vendor rate limits. No configuration needed.

VendorRate LimitBatch Size
GA4 Measurement Protocol1,000/sec25 events/request
Meta Conversions API2,000/sec1,000 events/request
TikTok Events API500/sec100 events/request
Amplitude1,000/sec2,000 events/request
Google BigQuery50,000/sec10,000 rows/request
WebhooksUnlimited1 event/request

Under normal traffic, events are delivered immediately (real-time). When the send rate approaches 80% of a vendor’s limit, events are automatically buffered and sent at the vendor’s maximum safe rate. This prevents 429 (Too Many Requests) errors and wasted retries.

GKE Sizing Guide

Recommended infrastructure sizing by customer volume:

SizeEvents/monthKafkaPostgreSQLRedisProcessorsEst. GCP Cost
SmallUp to 1MSingle broker, 1 partitionCloud SQL Basic (1 vCPU)Memorystore 1GB1 replica~$150/mo
Medium1–10MSingle broker, 3 partitionsCloud SQL Standard (2 vCPU)Memorystore 2GB2 replicas~$400/mo
Large10–100M3-broker cluster, 12 partitionsCloud SQL HA (4 vCPU)Memorystore 4GB6 replicas~$1,200/mo
Enterprise100M+5-broker cluster, 48 partitionsCloud SQL HA (8 vCPU)Redis Cluster 8GB24 replicas~$3,500/mo

Cost Optimisation

Spot Instances

Event processor, delivery workers, and ingestion gateway are stateless — they restart cleanly and are ideal for GKE spot (preemptible) nodes. This saves 60–80% on compute costs.

Only Kafka, PostgreSQL, and Redis need stable on-demand instances.

Autoscaling

Use Kubernetes HPA (Horizontal Pod Autoscaler) or KEDA for Kafka lag-based scaling:

ServiceMin ReplicasMax ReplicasScale Metric
Ingestion Gateway220CPU > 60%
Event Processor148Kafka consumer lag > 1000
Delivery Workers148Kafka consumer lag > 1000
Management API13CPU > 70%

Kafka Tiered Storage

For high-volume deployments, move Kafka data older than 24 hours to cloud object storage (GCS/S3):

  • Hot tier (0–24h): SSD — fast reads for real-time consumers
  • Cold tier (1–7d): Object storage — cheap storage for replay/recovery

Confluent for Kubernetes supports this natively. For open-source Kafka 3.6+, use remote storage plugins.

Monitoring

Key metrics to watch:

MetricWarning ThresholdAction
Kafka consumer lag (event-processor)> 5,000 messagesScale up event processor replicas or increase concurrency
Kafka consumer lag (delivery-workers)> 10,000 messagesScale up delivery workers or check vendor API health
Redis memory usage> 80% of maxIncrease Redis instance size or check for TTL issues
Event processor CPU> 70% sustainedScale up replicas
Delivery worker 429 responses> 1% of requestsCheck vendor rate limits, enable adaptive buffering

Signal’s startup retry logic handles infrastructure restarts gracefully — services wait up to 20 seconds for Redis and PostgreSQL to become available before giving up. No manual intervention needed for rolling restarts.