Kinesis Streams in Your Snowplow Pipeline

Edwin Mejias  
Edited

Help Center › Data Streaming › Pipeline Architecture

Every Snowplow BDP deployment on AWS includes five Kinesis Data Streams that carry events through different stages of the pipeline. This article explains what each stream does, how data retention works, and how autoscaling is managed on your behalf.

Contents

Pipeline overview

collector_payloads_stream

bad_1_stream

bad_2_stream

enriched_stream

incomplete_stream

Data retention

Autoscaling

Stream naming


Pipeline overview

Each stream has a specific role in the pipeline. Three carry events that need attention when something goes wrong (raw payloads, collector failures, enrichment failures), one carries your good data downstream to destinations, and one is an optional feature stream for loading failed events into your warehouse.

Stream Stage What it carries
collector_payloads_stream Pre-enrichment Raw payloads received by the collector, before any processing
bad_1_stream Enrich failures (all types) All bad events Enrich generates: deserialization failures, protocol violations, schema validation failures, enrichment errors, and oversized payloads from the Collector
bad_2_stream Loader failures Events the bad-events loaders (S3 Loader Bad, ES Loader Bad) could not write to their destination
enriched_stream Good events Successfully enriched events, ready for your configured destinations
incomplete_stream Failed Events in Warehouse Failed events routed to your warehouse (optional feature, off by default)

collector_payloads_stream

The collector writes every incoming event payload to this stream the moment it is received, before any decoding or validation occurs. The Enrich application reads from this stream, processes each event, and routes the result to either the good or bad streams downstream.

Events here are in a compressed raw binary format and are not yet associated with any schema. You would not read from this stream directly in normal operations, but it is the foundation that everything else builds on.

ℹ️ Note

This stream is the entry point to your pipeline. If it becomes undersized for your event volume, events will back up at the collector before they can be processed. Sustained collector latency increases are often a sign that this stream needs more capacity. Contact Snowplow Support if you are seeing this pattern.

Property Value
Stream suffix -collector-payloads
Written by Collector
Read by Enrich
Data format Thrift-serialized collector payloads (compressed)

bad_1_stream

Despite the name, the Collector writes to this stream only in one case: when a payload exceeds the Kinesis record size limit, it lands here directly as a size violation. For everything else, the Collector passes incoming payloads to the collector-payloads stream regardless of content. The primary writer is the Enrich application that writes here at two stages: before events are parsed (deserialization failures, protocol violations, adapter failures) and after parsing when individual events fail schema validation or enrichment processing.

Failure types that land here include malformed HTTP requests, invalid payload encoding, unsupported webhook adapter formats, tracker protocol violations, schema validation failures, enrichment errors, and payloads too large for the Kinesis record limit. Deserialization failures are caught at the batch level, a single bad-1 record can represent an entire batch of Snowplow events. Schema validation and enrichment failures are caught per-event after the payload is parsed, so each record represents a single event. A small, steady count is normal, bots and vulnerability scanners routinely generate this traffic. A sudden spike usually means a tracker misconfiguration, a schema issue, or an integration sending requests in an unexpected format.

ℹ️ Note

Events in this stream have not gone through enrichment. They cannot be replayed into your good data pipeline without fixing the underlying sending issue and re-sending from your application. Because the failure happens at the raw payload level, there is no schema-valid event to recover.

Property Value
Stream suffix -bad-1
Written by Collector (size violations only), Enrich
Read by Every component that process bad data, except Failed Events Loaders
Data format BadRow JSON

bad_2_stream

This stream captures write failures that occur downstream of enrichment, not enrichment failures themselves. When the loaders that consume bad-1 (the S3 Loader Bad, ES Loader Bad) are unable to deliver an event to their configured destination, that failed write lands here. This keeps the failure visible without losing the event data.

Common causes of bad-2 activity include misconfigured destination permissions, transient infrastructure issues affecting the loader's target, or resource limits on the destination. Under normal conditions this stream receives very little traffic. Sustained bad-2 activity points to a problem with the downstream destination rather than your tracking or schema definitions.

Each record includes the original bad event alongside a failure message describing why the loader could not deliver it.

💡 Tip

Schema validation and enrichment failures land in bad-1, not in this stream. If you need schema-valid failed events loaded into your warehouse for analysis alongside your good data, the Failed Events in Warehouse feature (using the incomplete_stream) can do this without any manual intervention. This can be setup in your Snowplow Console.

Property Value
Stream suffix -bad-2
Written by S3 Loader Bad, ES Loader Bad
Read by Debugging / alerting only
Data format BadRow JSON

enriched_stream

This is the main output stream of your Snowplow pipeline. The Enrich application writes every successfully processed event here in TSV format. Your configured destinations read from this stream: warehouse loaders and any custom consumers you have set up.

Each record contains the full enriched event, including all standard Snowplow fields and any custom context or self-describing event data. If this stream becomes undersized or starts lagging, events will not reach your warehouse on time. Throughput metrics on this stream are the most direct indicator of your pipeline's end-to-end health.

ℹ️ Note

This stream uses the AWS-managed default KMS key (alias/aws/kinesis) for encryption at rest by default. If your security requirements need a customer-managed key, contact Snowplow Support.

Property Value
Stream suffix -enriched
Written by Enrich
Read by Every component that process good data and custom consumers
Data format Snowplow enriched event (TSV)

incomplete_stream

This stream is part of the Failed Events in The Warehouse feature. When the feature is enabled, Enrich writes events here if they fail enrichment or schema validation. Those events are picked up by dedicated warehouse loaders and delivered to the configured destinations.

This stream is disabled by default. It receives no traffic unless the feature has been explicitly enabled for your pipeline. It is enabled as part of a Failed Events Loaders setup.

ℹ️ Note

The incomplete_stream is an additional, parallel destination alongside bad_1_stream  not a replacement for it. When the feature is enabled, bad events are written to both bad_1_stream and incomplete_stream at the same time. The bad_1_stream continues to receive all failures for monitoring. The incomplete_stream carries only the enrichment and schema validations failures sub-set, in TSV format, so a loader can deliver them to your warehouse along with additional columns that contain failure reason information. See How to Deploy Failed Events in Warehouse for the full setup guide. Events can be manipulated and reinserted into the Good tables if you need to do so, all with simple SQL at the data warehouses.

Property Value
Stream suffix -incomplete
Written by Enrich (only when feature is enabled)
Read by Failed Events warehouse loaders
Enabled by default No, unless Failed Events loaders are setup.
Data format Snowplow enriched event (TSV)

Data retention

All five streams default to a 7-day (168-hour) retention window. Records older than the retention period are automatically removed from the stream by AWS. This applies independently to each stream.

Retention can be extended up to 365 days (8,760 hours) if your use case requires a longer window for replay or audit purposes. Extended retention incurs additional charges from AWS that are separate from your Snowplow subscription cost.

Setting Value
Default retention 7 days (168 hours)
Minimum retention 24 hours
Maximum retention 365 days (8,760 hours)
To request a change Contact Snowplow Support with the stream name and desired retention period

⚠️ Warning

Retention changes take effect immediately on the stream. Reducing retention will permanently remove records older than the new window. If you plan to require events from that period later, do not reduce retention until the replay is complete.


Autoscaling

Each stream is provisioned with autoscaling enabled. Snowplow manages this on your behalf. When event throughput rises toward the current shard capacity, additional shards are added automatically. When throughput falls and stays below a lower threshold for a sustained period, shards are removed. A cooldown period between scale-in events prevents rapid oscillation during variable traffic patterns.

Your pipeline's initial shard count and scaling thresholds are set at deployment based on your expected event volume. Most customers do not need to adjust these settings during normal operations.

For planned high-traffic events such as product launches, major marketing campaigns, or seasonal peaks, let your Snowplow contact know in advance. Reviewing stream capacity ahead of time is faster than responding to throughput alerts once traffic has already arrived.

Autoscaling moves between a set min and max shards parameters that need to be manually adjusted by Snowplow depending on traffic expectations (1 - 64 by default).

💡 Tip

If you are seeing pipeline latency or throughput alerts, check the enriched_stream first. Sustained high throughput on that stream without corresponding increases on the bad streams confirms the traffic is legitimate volume, not a spike in bad data.


Stream naming

All Kinesis stream names follow this pattern:

Pattern
snowplow-{your-pipeline-identifier}-{stream-suffix}

Your pipeline identifier is derived from the stack key Snowplow uses for your deployment. You can find all five stream names in the AWS Console under Kinesis › Data Streams by filtering for snowplow.

Stream Suffix
collector_payloads_stream -collector-payloads
bad_1_stream -bad-1
bad_2_stream -bad-2
enriched_stream -enriched
incomplete_stream -incomplete