How to set up and load to Databricks Deltalake

New Destination Notice: Please validate any new destinations with a Customer Success Manager or Account Manager, as they may involve contractual implications.

We support Databricks Deltalake as a destination with our RDB Loader. You can set this up for token-based authentication using our self-service flow in Snowplow Console.

Prerequisites

Before starting you'll need:

Access to your Databricks workspace console, with permissions to create a cluster and execute SQL.
If setting up on GCP, access to the project where your Snowplow pipeline is deployed, with permissions to create a service account.

1. Setting up a GCP service account

Proceed straight to Step 2 if your pipeline isn't set up on GCP.

Databricks requires a service account to be created which will have access to the Google Cloud Storage (GCS) to be able to load data into the tables. Details can be found on How to set up Google Cloud service account. This needs to be supplied to Snowplow so that we can grant access to the GCS bucket that contains data to be loaded into Databricks.

There is no need for you to add access to any buckets. Snowplow only require the service account ID which is in the form of an email address. We don't need the key JSON that is generated, but is needed for you to configure Spark on the cluster later.

2. Create a cluster

We describe two ways of creating this cluster: you can follow the steps for manual setup via Databricks console outlined below or use the suggested Terraform code.

Setup via Databricks console and Cluster Sizing Based on Event Volumes

Snowplow provides a recommended starting configuration for setting up your Databricks cluster. However, the optimal cluster size depends on your projected and observed event volumes.

Use the guidance below to scale your cluster appropriately:

Estimated Monthly Volume	Peak Event Rate (est.)	Recommended Cluster Type	Suggested Node Count
Up to 1 billion events	~50K events/min	Single-node (e.g. `m5d.large`)	1 node
~10 billion events	~500K events/min	Multi-node, Photon enabled	2–3 nodes
~50 billion events	~2.5M events/min	Multi-node (e.g. `m5d.large` or equivalent)	2 nodes (min), scale as needed
Load Testing (e.g. 200K events/sec)	~12M events/min	Multi-node, Photon strongly recommended	6+ nodes

Rule of thumb: To estimate peak per-minute event volume, divide monthly volume by 20,000. This accounts for natural traffic peaks and valleys.

Additional Tips:

Photon Acceleration: Highly recommended for volumes above 1B events/month to significantly improve ingestion throughput.
Auto-termination: Set to 30 minutes to optimize cost when clusters are idle.
Cluster Monitoring: Once live, monitor cluster utilization in the Databricks console to refine sizing. Snowplow does not have visibility into your Databricks environment.
Traffic Spikes: For workloads with high variability (e.g. marketing campaigns, sports events):
- Option 1: Over-provision the cluster.
- Option 2: Accept minor latency during peak and let Databricks catch up during lower load periods.

Autoscaling Note

While we haven’t formally validated autoscaling for this workload, you're welcome to experiment with it. We recommend close monitoring if you choose this path.

Setup via Terraform

Here's an example cluster configuration using the Databricks Terraform provider:

data "databricks_node_type" "smallest" {
  local_disk = true
  category   = "General Purpose"
}

data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "single_node" {
  cluster_name            = "DatabricksLoaderCluster"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 30
  num_workers             = 0

  spark_conf = {
    "spark.databricks.cluster.profile" : "singleNode"
    "spark.master" : "local[*, 4]"
  }

  custom_tags = {
    "ResourceClass" = "SingleNode"
  }
}

Mandatory configuration settings on GCP

The following settings will need to be defined for the spark configuration:

spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>

Replace <> values with the values of those exact field names from your key JSON file (step 1).

Optional configuration settings on AWS
If you use AWS Glue Data Catalog as your metastore, follow these Databricks instructions for the relevant spark configurations. You will need to set spark.databricks.hive.metastore.glueCatalog.enabled true and spark.hadoop.hive.metastore.glue.catalogid <aws-account-id-for-glue-catalog> in the spark configuration.
You can configure your cluster with an instance profile if it needs extra permissions to access resources. For example, if the S3 bucket holding the delta lake is in a different AWS account.
Other optional cluster settings
You might want to configure cluster-level permissions, by following the Databricks instructions on cluster access control. Snowplow's RDB Loader must be able to restart the cluster if it is terminated.
If the Databricks account has IP safelisting added, then Snowplow need to know this to supply two NAT IP/Elastic IP addresses that AWS/GCP account will connect across to Databricks on.

3. Note the JDBC connection details for the cluster

In the Databricks UI, click on "Compute" in the sidebar
Click on the RDB Loader cluster and navigate to "Advanced options"
Click on the "JDBC/ODBC" tab
On the tab, you will find the follow values that map to the values that Snowplow require:
Field JDBC/ODBC tab value
deltalake_host Server Hostname
deltalake_http_path HTTP Path
deltalake_port Port

4. Create access token for the RDB Loader

Note: The access token must not have a specified lifetime. Otherwise, RDB Loader will stop working when the token expires.

Navigate to the user settings in your Databricks workspace
For Databricks hosted on AWS, the "Settings" link is in the lower left corner in the side panel
For Databricks hosted on Azure, "User Settings" is an option in the drop-down menu in the top right corner
Go to the "Access Tokens" tab
Click the "Generate New Token" button
Optionally enter a description (comment). Leave the expiration period empty
Click the "Generate" button
Copy the generated token and store in a secure location

5. Create schema (database)

Note: You can change the name of the schema to be used (the default is snowplow)

-- Uncomment if your want to use a custom Unity catalog
-- and replace with your own value.
-- USE CATALOG <custom_unity_catalog>;

CREATE SCHEMA IF NOT EXISTS snowplow
-- Uncomment if you want tables created by Snowplow to be
-- located in a non-default bucket or directory.
-- LOCATION s3://<custom_location>/
;

6. Sharing all the details back with Snowplow

We require the following destination details:

The service account ID (GCP only)
The Deltalake host, http_path and port
The access token
The schema name
The catalog name, if you are using a custom Unity catalog

AWS

Please submit the destination details at the Snowplow BDP Console / Setup Databricks page

GCP

Please submit the details via the Snowplow BDP Console using secure messaging.