Scaling for High Volume
Signal scales via configuration — the same code and Docker images handle 1,000 events/sec and 2,000,000+ events/sec. The difference is environment variable settings and infrastructure sizing.
Throughput Tiers
| Tier | Partitions | Concurrency | Delivery Mode | Max Events/sec |
|---|---|---|---|---|
| Default | 3 | 3 | Single topic, 10 workers | 100,000–150,000 |
| Medium | 12 | 12 | Single topic, 50 workers | 500,000 |
| Large | 48 | 48 | Per-vendor topics, 100 workers each | 2,000,000+ |
| Maximum | 96 | 96 | Per-vendor topics, 200 workers each | 5,000,000+ |
The default configuration handles most customers. Only increase when you’re consistently above 50,000 events/sec.
Scaling Environment Variables
These variables control throughput. All are optional — defaults are safe for small-to-medium deployments.
Kafka & Processing
| Variable | Default | Description |
|---|---|---|
KAFKA_NUM_PARTITIONS | 3 | Number of partitions per Kafka topic. Set in infrastructure compose. More partitions = more concurrent consumers. |
CONSUMER_CONCURRENCY | 3 | Number of concurrent message handlers in the event processor. Should match partition count. |
DELIVERY_CONCURRENCY | 10 | Number of concurrent delivery goroutines in the delivery workers. |
REDIS_POOL_SIZE | 100 | Redis connection pool size. Increase for high-concurrency deployments. |
DB_MAX_CONNS | 50 | PostgreSQL max connection pool size. |
Per-Vendor Delivery Topics
| Variable | Default | Description |
|---|---|---|
DELIVERY_TOPIC_MODE | single | Set to per_vendor to publish delivery events to separate topics per vendor type (e.g. delivery-ga4, delivery-meta_capi). |
VENDOR_TYPE | (empty) | When DELIVERY_TOPIC_MODE=per_vendor, set this on each delivery worker instance to consume from a specific vendor topic (e.g. ga4, meta_capi, tiktok). |
When DELIVERY_TOPIC_MODE=per_vendor, deploy one delivery worker instance per vendor type. Each instance sets VENDOR_TYPE to its vendor. This allows independent scaling — a slow vendor (rate-limited API) doesn’t block other vendors.
Bot Pre-Filter
| Variable | Default | Description |
|---|---|---|
BOT_PREFILTER | false | Set to true to run bot detection in a separate service before the main event processor. |
When enabled, deploy the botfilter binary (from event-processor/cmd/botfilter/) as a separate service. It reads from raw-events, drops bots, and publishes clean events to filtered-events. The main event processor then consumes from filtered-events instead of raw-events.
This is beneficial when bot traffic exceeds 30% of total volume — the event processor only handles real events, reducing CPU and Kafka throughput.
Adaptive Delivery
The delivery workers automatically manage vendor rate limits. No configuration needed.
| Vendor | Rate Limit | Batch Size |
|---|---|---|
| GA4 Measurement Protocol | 1,000/sec | 25 events/request |
| Meta Conversions API | 2,000/sec | 1,000 events/request |
| TikTok Events API | 500/sec | 100 events/request |
| Amplitude | 1,000/sec | 2,000 events/request |
| Google BigQuery | 50,000/sec | 10,000 rows/request |
| Webhooks | Unlimited | 1 event/request |
Under normal traffic, events are delivered immediately (real-time). When the send rate approaches 80% of a vendor’s limit, events are automatically buffered and sent at the vendor’s maximum safe rate. This prevents 429 (Too Many Requests) errors and wasted retries.
GKE Sizing Guide
Recommended infrastructure sizing by customer volume:
| Size | Events/month | Kafka | PostgreSQL | Redis | Processors | Est. GCP Cost |
|---|---|---|---|---|---|---|
| Small | Up to 1M | Single broker, 1 partition | Cloud SQL Basic (1 vCPU) | Memorystore 1GB | 1 replica | ~$150/mo |
| Medium | 1–10M | Single broker, 3 partitions | Cloud SQL Standard (2 vCPU) | Memorystore 2GB | 2 replicas | ~$400/mo |
| Large | 10–100M | 3-broker cluster, 12 partitions | Cloud SQL HA (4 vCPU) | Memorystore 4GB | 6 replicas | ~$1,200/mo |
| Enterprise | 100M+ | 5-broker cluster, 48 partitions | Cloud SQL HA (8 vCPU) | Redis Cluster 8GB | 24 replicas | ~$3,500/mo |
Cost Optimisation
Spot Instances
Event processor, delivery workers, and ingestion gateway are stateless — they restart cleanly and are ideal for GKE spot (preemptible) nodes. This saves 60–80% on compute costs.
Only Kafka, PostgreSQL, and Redis need stable on-demand instances.
Autoscaling
Use Kubernetes HPA (Horizontal Pod Autoscaler) or KEDA for Kafka lag-based scaling:
| Service | Min Replicas | Max Replicas | Scale Metric |
|---|---|---|---|
| Ingestion Gateway | 2 | 20 | CPU > 60% |
| Event Processor | 1 | 48 | Kafka consumer lag > 1000 |
| Delivery Workers | 1 | 48 | Kafka consumer lag > 1000 |
| Management API | 1 | 3 | CPU > 70% |
Kafka Tiered Storage
For high-volume deployments, move Kafka data older than 24 hours to cloud object storage (GCS/S3):
- Hot tier (0–24h): SSD — fast reads for real-time consumers
- Cold tier (1–7d): Object storage — cheap storage for replay/recovery
Confluent for Kubernetes supports this natively. For open-source Kafka 3.6+, use remote storage plugins.
Monitoring
Key metrics to watch:
| Metric | Warning Threshold | Action |
|---|---|---|
| Kafka consumer lag (event-processor) | > 5,000 messages | Scale up event processor replicas or increase concurrency |
| Kafka consumer lag (delivery-workers) | > 10,000 messages | Scale up delivery workers or check vendor API health |
| Redis memory usage | > 80% of max | Increase Redis instance size or check for TTL issues |
| Event processor CPU | > 70% sustained | Scale up replicas |
| Delivery worker 429 responses | > 1% of requests | Check vendor rate limits, enable adaptive buffering |
Signal’s startup retry logic handles infrastructure restarts gracefully — services wait up to 20 seconds for Redis and PostgreSQL to become available before giving up. No manual intervention needed for rolling restarts.