AWS

Learn about Snowplow's S3 buckets

Pavol Kutaj  
Edited

This article describes a feature available to Snowplow BDP customers on Private Managed Cloud (PMC).

The Snowplow pipeline uses several S3 buckets as temporary storage locations during processing, and also to store events using different formats or levels of enrichment.

Depending on your specific pipeline configuration, you may be able to reduce reliance on S3 storage. As shown above for users using our RDB Loader, some buckets are optional, and this increases for customers using our new Streaming Loader solution.

All Snowplow S3 buckets are prefixed with com-acme-111111111111-0-prod1- where com-acme is your customer tag and 111111111111 is your AWS ID.

 

S3 buckets

This section provides an overview of each S3 bucket maintained by Snowplow.

 

kinesis-hocons

  • contains real-time pipleine related configurations
  • recommended lifecycle rule: none

 

kinesis-s3-bad

  • bad data archive
  • set lifecycle rule but to be able to recover any data from it you need
  • helps to to keep an eye on incoming bad data to spot any that need to be reprocessed
  • recovery process could take a few weeks
  • recommended lifecycle rule: 30-day retention

 

kinesis-s3-enriched

  • (usually) a temp location for enriched data produced by real-time pipeline
  • the files in it are regularly moved to a different bucket for processing by the batch job (archive bucket see below)
  • recommended lifecycle rule: none
<!-- DATA EXAMPLE --> 
<bucket>/prod1/2022-10-31-160401-*.gz

 

kinesis-s3-raw

  • archive of raw (before enrichment) data
  • reflects the actual collected data
  • is essentially optional as enriched and shredded version of the same data are in different buckets
  • some clients even opt to turn accumulating that data off completely and rely just on enriched and/or shredded as well as Redshift/Snowflake DB
  • thus essentially it's "duplicated" data but in a different format (thrift).
  • reasons are historical:
    • the enrichment process used to be as part of a batch-job
    • currently, collected data is enriched in real-time section of the pipeline
    • the loader/storage was therefore designed as fallback for re-enrichment/security measure to ensure that if there are any upstream issues
    • none of this is activelly used
  • you there is a need to keep all un-enriched historical data
  • it's a good idea to lifecycle it to Glacier as a cheaper storage solution.
  • recommended lifecycle rule: 7-day retention

 

batch-archive

  • all historical data divided into
    • /enriched folder
    • /shredded folder (this may be named after your destination)
  • source: com-acme-...-kinesis-s3-enriched bucket
  • initially, the data is moved to this bucket into /<env1>/enriched folder with the aim to be
    1. processed by batch transformer
    2. consumed by tools like Athena or custom consumers
  • there is a dedicated folder for each batch/run which makes querying/consumption much easier than having to work with a huge folder as would be the case with kinesis-s3-enriched bucket (see above)
  • once processed by batch transformer, the transformed data is loaded into <env1>/shredded folder
  • recommended lifecycle rule: none
<!-- FOLDER STRUCTURE DATA EXAMPLE --> 
.../enriched/good/run=2022-10-31-16-01-31/2022-10-31-160401-*.gz

Notify Snowplow before updating lifecycle rules on the batch-archive bucket as we must update the internal since_timestamp for the transformer. Otherwise, there is a high risk that transforming and loading the data into the destination completely fails or has high latency.

 

batch-output

  • temp location for currently processed data
  • the files in it are regularly moved to a different bucket for processing by the batch job
  • recommended lifecycle rule: none

 

batch-processing

  • currently used for logs only
  • recommended lifecycle rule: 30..90-day retention

 

hosted-assets

  • we use snowplow-hosted-assets-us-east-1 instead (if any)
  • recommended lifecycle rule: none

 

iglu-jsonpaths

  • created as part of the new pipeline for the storage of JSON Paths if automigration is not activated
  • the bucket from the old pipeline is being used in actuality, which is snowplow-com-acme-igl-jsonpaths
  • recommended lifecycle rule: none

 

iglu-schemas

  • used alongside your new Iglu servers as well
  • recommended lifecycle rule: none

 

FAQ

Please see below FAQs about Snowplow's S3 buckets.

 

Can I set up lifecycle policies for my buckets?

Yes, we don't take responsibility for managing lifecycle rules for S3 buckets, so you are fully empowered to configure lifecycle policies for your buckets. We have made what we believe to be best-practice recommendations for each bucket in the above section.

 

Do you support intelligent tiering?

Yes, we support Amazon S3 Intelligent-Tiering. If you're interested, please contact our friendly support team.

 

I've found buckets prefixed with sp-.

In case you find buckets starting with an sp- prefix, these are pre-terraform buckets created by our old deployment tool. If you're no longer using these sp- buckets, they can be safely deleted or archived.

AWS S3