The aim of this reference is to provide a list of utilized S3 buckets together with the recommended lifecycle rules for cost optimization.
Generally speaking, Snowplow is responsible for ensuring that data is successfully received by all components and delivered to all data sources (streams, S3, storage targets: Elasticsearch, Redshift, Snowflake). It is then up to our clients to decide for how long it should be kept and where.
Given this, we don't configure lifecycle rules, however you are free to setup any lifecycle rules which are sensible according to your organization's internal policies and use cases.
For the rest of this article com-acme
is your customer tag and 111111111111
is your AWS ID.
We also support Amazon S3 Intelligent-Tiering. If you're interested, please contact support.
In case you find any buckets starting with an sp-
prefix: These are pre-terraform buckets created by our old deployment tool for the pre-terraform pipelines. The buckets are still there because we never delete customer data ourselves — clients have been informed about the change at the time of the migration. If sp-
buckets are not used by the customer, they can be deleted/archived. We recommend to sanity check if there is any custom job running that may still reference the bucket. The customers themselves should be aware of a dependency, too.
com-acme-111111111111-0-prod1-kinesis-hocons
- contains real-time pipleine related configurations
- recommended lifecycle rule: none
com-acme-111111111111-0-prod1-kinesis-s3-bad
- bad data archive
- set lifecycle rule but to be able to recover any data from it you need
- helps to to keep an eye on incoming bad data to spot any that need to be reprocessed
- recovery process could take a few weeks
- recommended lifecycle rule: 30-day retention
com-acme-111111111111-0-prod1-kinesis-s3-enriched
- (usually) a temp location for enriched data produced by real-time pipeline
- the files in it are regularly moved to a different bucket for processing by the batch job (
archive
bucket see below) - recommended lifecycle rule: none
<!-- DATA EXAMPLE -->
<bucket>/prod1/2022-10-31-160401-*.gz
com-acme-111111111111-0-prod1-kinesis-s3-raw
- archive of raw (before enrichment) data
- reflects the actual collected data
- is essentially optional as enriched and shredded version of the same data are in different buckets
- some clients even opt to turn accumulating that data off completely and rely just on enriched and/or shredded as well as Redshift/Snowflake DB
- thus essentially it's "duplicated" data but in a different format (thrift).
- reasons are historical:
- the enrichment process used to be as part of a batch-job
- currently, collected data is enriched in real-time section of the pipeline
- the loader/storage was therefore designed as fallback for re-enrichment/security measure to ensure that if there are any upstream issues
- none of this is activelly used
- you there is a need to keep all un-enriched historical data
- it's a good idea to lifecycle it to Glacier as a cheaper storage solution.
- recommended lifecycle rule: 7-day retention
com-acme-111111111111-0-prod1-batch-archive
- all historical data divided into
-
/enriched
folder -
/shredded
folder (this may be named after your destination)
-
- source:
com-acme-...-kinesis-s3-enriched
bucket - initially, the data is moved to this bucket into
/<env1>/enriched
folder with the aim to be- processed by batch transformer
- consumed by tools like Athena or custom consumers
- there is a dedicated folder for each batch/run which makes querying/consumption much easier than having to work with a huge folder as would be the case with
kinesis-s3-enriched
bucket (see above) - once processed by batch transformer, the transformed data is loaded into
<env1>/shredded
folder - recommended lifecycle rule: none
<!-- FOLDER STRUCTURE DATA EXAMPLE -->
.../enriched/good/run=2022-10-31-16-01-31/2022-10-31-160401-*.gz
Notify Snowplow before updating lifecycle rules on the batch-archive
bucket as we must update the internal since_timestamp
for the transformer. Otherwise, there is a high risk that transforming and loading the data into the destination completely fails or has high latency.
com-acme-111111111111-0-prod1-batch-output
- temp location for currently processed data
- the files in it are regularly moved to a different bucket for processing by the batch job
- recommended lifecycle rule: none
com-acme-111111111111-0-prod1-batch-processing
- currently used for logs only
- recommended lifecycle rule: 30..90-day retention
com-acme-111111111111-0-prod1-hosted-assets
- we use snowplow-hosted-assets-us-east-1 instead (if any)
- recommended lifecycle rule: none
com-acme-111111111111-0-prod1-iglu-jsonpaths
- created as part of the new pipeline for the storage of JSON Paths if automigration is not activated
- the bucket from the old pipeline is being used in actuality, which is
snowplow-com-acme-igl-jsonpaths
- recommended lifecycle rule: none
com-acme-111111111111-0-prod1-iglu-schemas
- used alongside your new Iglu servers as well
- recommended lifecycle rule: none