Learn about Snowplow's S3 buckets

This article describes a feature available to Snowplow CDI customers on Private Managed Cloud (PMC).

The Snowplow pipeline uses several S3 buckets as temporary storage locations during processing, and also to store events using different formats or levels of enrichment.

Depending on your specific pipeline configuration, you may be able to reduce reliance on S3 storage. As shown above for users using our RDB Loader, some buckets are optional, and this increases for customers using our new Streaming Loader solution.

All Snowplow S3 buckets are prefixed with com-acme-111111111111-0-prod1- where com-acme is your customer tag and 111111111111 is your AWS ID.

S3 buckets

This section provides an overview of each S3 bucket maintained by Snowplow.

kinesis-hocons

contains real-time pipleine related configurations
recommended lifecycle rule: none

kinesis-s3-bad

bad data archive
set lifecycle rule but to be able to recover any data from it you need
helps to to keep an eye on incoming bad data to spot any that need to be reprocessed
recovery process could take a few weeks
recommended lifecycle rule: 30-day retention

kinesis-s3-enriched

(usually) a temp location for enriched data produced by real-time pipeline
the files in it are regularly moved to a different bucket for processing by the batch job (archive bucket see below)
recommended lifecycle rule: none

<!-- DATA EXAMPLE --> 
<bucket>/prod1/2022-10-31-160401-*.gz

kinesis-s3-raw

no longer used, see https://support.snowplow.io/hc/en-us/articles/29172752000669--AWS-Sunsetting-the-S3-raw-bucket
can be safely deleted

batch-archive

all historical data divided into
- /enriched folder
- /shredded folder (this may be named after your destination)
source: com-acme-...-kinesis-s3-enriched bucket
initially, the data is moved to this bucket into /<env1>/enriched folder with the aim to be
1. processed by batch transformer
2. consumed by tools like Athena or custom consumers
there is a dedicated folder for each batch/run which makes querying/consumption much easier than having to work with a huge folder as would be the case with kinesis-s3-enriched bucket (see above)
once processed by batch transformer, the transformed data is loaded into <env1>/shredded folder
recommended lifecycle rule: none

<!-- FOLDER STRUCTURE DATA EXAMPLE --> 
.../enriched/good/run=2022-10-31-16-01-31/2022-10-31-160401-*.gz

Notify Snowplow before updating lifecycle rules on the batch-archive bucket as we must update the internal since_timestamp for the transformer. Otherwise, there is a high risk that transforming and loading the data into the destination completely fails or has high latency.

batch-output

temp location for currently processed data
the files in it are regularly moved to a different bucket for processing by the batch job
recommended lifecycle rule: none

batch-processing

currently used for logs only
recommended lifecycle rule: 30..90-day retention

hosted-assets

we use snowplow-hosted-assets-us-east-1 instead (if any)
recommended lifecycle rule: none

iglu-jsonpaths

created as part of the new pipeline for the storage of JSON Paths if automigration is not activated
the bucket from the old pipeline is being used in actuality, which is snowplow-com-acme-igl-jsonpaths
recommended lifecycle rule: none

iglu-schemas

used alongside your new Iglu servers as well
recommended lifecycle rule: none

FAQ

Please see below FAQs about Snowplow's S3 buckets.

Can I set up lifecycle policies for my buckets?

Yes, we don't take responsibility for managing lifecycle rules for S3 buckets, so you are fully empowered to configure lifecycle policies for your buckets. We have made what we believe to be best-practice recommendations for each bucket in the above section.

For more details and step-by-step instructions on setting up lifecycle policies, please refer to the AWS documentation.

Do you support intelligent tiering?

Yes, we support Amazon S3 Intelligent-Tiering. If you're interested, please contact our friendly support team.

I've found buckets prefixed with `sp-`.

In case you find buckets starting with an sp- prefix, these are pre-terraform buckets created by our old deployment tool. If you're no longer using these sp- buckets, they can be safely deleted or archived.