How to load data into Delta with S3 bucket

Mariia Khusainova  
Edited

This playbook describes how to configure a S3 bucket, so that Snowplow can load events into your Deltadata lake.

Delta on S3

Step 1: Create an S3 bucket where Snowplow will create the data lake and load the events.

The bucket can be created in any AWS sub-accounts but not in the same as into which Snowplow is deploying the rest of the pipeline. 

AWS documentation on on how to create a bucket

Example:

aws s3api create-bucket \
--bucket=<BUCKET_NAME> \
--region=<REGION> \
--create-bucket-configuration='{
"LocationConstraint": "<REGION>"
}'

Step 2: Create a role with permissions to access the bucket

This is the IAM role that Snowplow will assume when loading data into your S3 bucket. This is needed because we assume the bucket is created in a different sub-account to where Snowplow is deploying the rest of your pipeline.

AWS documentation on how to create a role that delegates permissions to another principal.

For example, to do this using the aws command line tool:

aws iam create-role \
--role-name=snowplow-lake-loader-prod \
--description="Used by Snowplow to load events into delta" \
--assume-role-policy-document='{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"AWS": "<ACCOUNT_ID>"
},
"Condition": {
"StringEquals": {
"sts:ExternalId": "<EXTERNAL_ID>"
}
},
"Action": "sts:AssumeRole"
}
]
}'

Replace <ACCOUNT_ID> with the ID of the AWS account into which Snowplow is deploying your pipeline.

Replace <EXTERNAL_ID> with an ID provided to you by Snowplow.

Step 3: Grant bucket access to the role

This step ensures that the new new IAM role has permissions to write events to your new bucket.

AWS documentation on how to attach an access policy to a role.

For example, if you choose to use the aws command line tool:

aws iam put-role-policy \
--role-name=snowplow-lake-loader-prod \
--policy-name=snowplow-lake-loader-policy \
--policy-document='{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObject",
"s3:AbortMultipartUpload"
],
"Resource": "arn:aws:s3:::<BUCKET_NAME>/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:ListBucketMultipartUploads"
],
"Resource": "arn:aws:s3:::<BUCKET_NAME>"
}
]
}'

Replace <BUCKET_NAME> with the bucket you created in step 1. The --role-name refers to the role you created in step 2.

Step 4: Provide details to Snowplow

We require the following details from you:

  • The name of the S3 bucket where we should load events
  • The AWS region of the S3 bucket, e.g. us-east-1
  • The name of a sub-directory inside the bucket where we should create the events table. If not specified, the sub-directory will be called "events".
  • The ARN of the role you created in step 2.

For loading into Delta Lake from GCP check this article.