This playbook describes how to configure a S3 bucket and Glue catalog, so that Snowplow can load events into your Iceberg data lake.
Iceberg on S3
Step 1: Create an S3 bucket where Snowplow will create the data lake and load the events.
The bucket can be created in any AWS sub-accounts but not in the same as into which Snowplow is deploying the rest of the pipeline.
AWS documentationon on how to create a bucket
Example:
aws s3api create-bucket \
--bucket=<BUCKET_NAME> \
--region=<REGION> \
--create-bucket-configuration='{
"LocationConstraint": "<REGION>"
}'
Step 2: Create a Glue database which Snowplow will use as the Iceberg catalog.
You will connect to this database when you query events in the data lake. The bucket location must be in the same region and AWS sub-account as the S3 bucket you created in the previous step.
AWS documentation on how to create a Glue database
Example:
aws glue create-database \
--region=<REGION> \
--database-input='{
"Name": "<DATABASE_NAME>",
"Description": "Snowplow events"
}'
Step 3: Create a role which Snowplow can assume when loading data into your S3 bucket and Glue database.
Example using the aws
command line tool:
aws iam create-role \
--role-name=snowplow-lake-loader-prod \
--description="Used by Snowplow to load events into iceberg" \
--assume-role-policy-document='{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"AWS": "<ACCOUNT_ID>"
},
"Condition": {
"StringEquals": {
"sts:ExternalId": "<EXTERNAL_ID>"
}
},
"Action": "sts:AssumeRole"
}
]
}'
Replace <ACCOUNT_ID>
with the ID of the AWS account into which Snowplow is deploying your pipeline.
Replace <EXTERNAL_ID>
with an ID provided to you by Snowplow.
Step 4: Grant access to the role to ensure that the new IAM role has permissions to write events to your new bucket.
Example using the aws
command line tool:
aws iam put-role-policy \
--role-name=snowplow-lake-loader-prod \
--policy-name=snowplow-lake-loader-policy \
--policy-document='{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::<BUCKET_NAME>/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::<BUCKET_NAME>"
},
{
"Effect": "Allow",
"Action": [
"glue:CreateTable",
"glue:GetTable",
"glue:UpdateTable"
],
"Resource": [
"arn:aws:glue:<REGION>:<ACCOUNT_ID>:catalog",
"arn:aws:glue:<REGION>:<ACCOUNT_ID>:database/<DATABASE_NAME>",
"arn:aws:glue:<REGION>:<ACCOUNT_ID>:table/<DATABASE_NAME>/events"
]
}
]
}'
Replace <BUCKET_NAME> with the bucket you created in step 1, and <DATABASE_NAME> with the database you created in step 2. The --role-name refers to the role you created in step 3.
Step 5: Provide details to Snowplow.
The following is required:
- The name of the S3 bucket where we should load events.
- The AWS account ID of the glue catalog. This is a 12-digit number, such as 012345678901.
- The AWS region of the S3 bucket and Glue catalog, e.g. us-east-1.
- (Optional) The name of a sub-directory inside the bucket where we should create the iceberg table. If not specified, the sub-directory will be called "events".
- The ARN of the role you created in step 3.