S3 Buckets and Recommended Lifecycle Rules

Pavol Kutaj
Snowplow Team
Edited

The aim of this reference is to provide a list of utilized S3 buckets together with the recommended lifecycle rules for cost optimization.

Generally speaking, Snowplow is responsible for ensuring that data is successfully received by all components and delivered to all data sources (streams, S3, storage targets: Elasticsearch, Redshift, Snowflake). It is then up to our clients to decide for how long it should be kept and where.

Given this, we don't configure lifecycle rules, however you are free to setup any lifecycle rules which are sensible according to your organization's internal policies and use cases.

For the rest of this article com-acme is your customer tag and 111111111111 is your AWS ID.

 

We also support Amazon S3 Intelligent-Tiering. If you're interested, please contact support.

In case you find any buckets starting with an sp- prefix: These are pre-terraform buckets created by our old deployment tool for the pre-terraform pipelines. The buckets are still there because we never delete customer data ourselves — clients have been informed about the change at the time of the migration. If sp- buckets are not used by the customer, they can be deleted/archived. We recommend to sanity check if there is any custom job running that may still reference the bucket. The customers themselves should be aware of a dependency, too.

 

com-acme-111111111111-0-prod1-kinesis-hocons

  • contains real-time pipleine related configurations
  • recommended lifecycle rule: none

 

com-acme-111111111111-0-prod1-kinesis-s3-bad

  • bad data archive
  • set lifecycle rule but to be able to recover any data from it you need
  • helps to to keep an eye on incoming bad data to spot any that need to be reprocessed
  • recovery process could take a few weeks
  • recommended lifecycle rule: 30-day retention

 

com-acme-111111111111-0-prod1-kinesis-s3-enriched

  • (usually) a temp location for enriched data produced by real-time pipeline
  • the files in it are regularly moved to a different bucket for processing by the batch job (archive bucket see below)
  • recommended lifecycle rule: none
<!-- DATA EXAMPLE --> 
<bucket>/prod1/2022-10-31-160401-*.gz

 

com-acme-111111111111-0-prod1-kinesis-s3-raw

  • archive of raw (before enrichment) data
  • reflects the actual collected data
  • is essentially optional as enriched and shredded version of the same data are in different buckets
  • some clients even opt to turn accumulating that data off completely and rely just on enriched and/or shredded as well as Redshift/Snowflake DB
  • thus essentially it's "duplicated" data but in a different format (thrift).
  • reasons are historical:
    • the enrichment process used to be as part of a batch-job
    • currently, collected data is enriched in real-time section of the pipeline
    • the loader/storage was therefore designed as fallback for re-enrichment/security measure to ensure that if there are any upstream issues
    • none of this is activelly used
  • you there is a need to keep all un-enriched historical data
  • it's a good idea to lifecycle it to Glacier as a cheaper storage solution.
  • recommended lifecycle rule: 7-day retention

 

com-acme-111111111111-0-prod1-batch-archive

  • all historical data divided into
    • /enriched folder
    • /shredded folder (this may be named after your destination)
  • source: com-acme-...-kinesis-s3-enriched bucket
  • initially, the data is moved to this bucket into /<env1>/enriched folder with the aim to be
    1. processed by batch transformer
    2. consumed by tools like Athena or custom consumers
  • there is a dedicated folder for each batch/run which makes querying/consumption much easier than having to work with a huge folder as would be the case with kinesis-s3-enriched bucket (see above)
  • once processed by batch transformer, the transformed data is loaded into <env1>/shredded folder
  • recommended lifecycle rule: none
<!-- FOLDER STRUCTURE DATA EXAMPLE --> 
.../enriched/good/run=2022-10-31-16-01-31/2022-10-31-160401-*.gz

Notify Snowplow before updating lifecycle rules on the batch-archive bucket as we must update the internal since_timestamp for the transformer. Otherwise, there is a high risk that transforming and loading the data into the destination completely fails or has high latency.

 

com-acme-111111111111-0-prod1-batch-output

  • temp location for currently processed data
  • the files in it are regularly moved to a different bucket for processing by the batch job
  • recommended lifecycle rule: none

 

com-acme-111111111111-0-prod1-batch-processing

  • currently used for logs only
  • recommended lifecycle rule: 30..90-day retention

 

com-acme-111111111111-0-prod1-hosted-assets

  • we use snowplow-hosted-assets-us-east-1 instead (if any)
  • recommended lifecycle rule: none

 

com-acme-111111111111-0-prod1-iglu-jsonpaths

  • created as part of the new pipeline for the storage of JSON Paths if automigration is not activated
  • the bucket from the old pipeline is being used in actuality, which is snowplow-com-acme-igl-jsonpaths
  • recommended lifecycle rule: none

 

com-acme-111111111111-0-prod1-iglu-schemas

  • used alongside your new Iglu servers as well
  • recommended lifecycle rule: none

Was this article helpful?