How to set up and load to Databricks Deltalake

Alec Moloney  
Edited

We support Databricks Deltalake as a destination with our RDB Loader. You can set this up for token-based authentication using our self-service flow in Snowplow Console.

 

Prerequisites

Before starting you'll need:

  • Access to your Databricks workspace console, with permissions to create a cluster and execute SQL.
  • If setting up on GCP, access to the project where your Snowplow pipeline is deployed, with permissions to create a service account.

 

1. Setting up a GCP service account

Proceed straight to Step 2 if your pipeline isn't set up on GCP.

Databricks requires a service account to be created which will have access to the Google Cloud Storage (GCS) to be able to load data into the tables. Details can be found on How to set up Google Cloud service account. This needs to be supplied to Snowplow so that we can grant access to the GCS bucket that contains data to be loaded into Databricks.

There is no need for you to add access to any buckets. Snowplow only require the service account ID which is in the form of an email address. We don't need the key JSON that is generated, but is needed for you to configure Spark on the cluster later.

 

2. Create a cluster

We describe two ways of creating this cluster: you can follow the steps for manual setup via Databricks console outlined below or use the suggested Terraform code.

 

Setup via Databricks console

Create a new cluster, following the Databricks documentation, with the following settings:

  • Single node cluster
  • The runtime version must be at 13.0 or greater
  • If your event volume is greater than 1 billion events per month, please check the box to enable Photon Acceleration
  • The smallest node types are adequate for all event volumes, e.g.
    • m4.large on AWS
    • n2-highmem-2 on GCP
    • DS3 v2 on Azure
    • or the equivalent available node type if you enable photon acceleration
  • Auto-terminate after 30 minutes

Scale up if appropriate for your volumes and workloads. 

Check the Databricks pricing calculator for the cost of the different cluster configurations. Note that photon acceleration is approximately twice as expensive as a standard cluster, but it has a big impact on the speed of loading Snowplow events. At high event volumes, this might be required to prevent the Loader becoming a bottleneck.

We don't recommend using a SQL endpoint for loading, because this compute is more expensive than a single node general purpose cluster.

 

Setup via Terraform

Here's an example cluster configuration using the Databricks Terraform provider:

data "databricks_node_type" "smallest" {
  local_disk = true
  category   = "General Purpose"
}

data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "single_node" {
  cluster_name            = "DatabricksLoaderCluster"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 30
  num_workers             = 0

  spark_conf = {
    "spark.databricks.cluster.profile" : "singleNode"
    "spark.master" : "local[*, 4]"
  }

  custom_tags = {
    "ResourceClass" = "SingleNode"
  }
}

Mandatory configuration settings on GCP

The following settings will need to be defined for the spark configuration:

spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>

Replace <> values with the values of those exact field names from your key JSON file (step 1).

Optional configuration settings on AWS
If you use AWS Glue Data Catalog as your metastore, follow these Databricks instructions for the relevant spark configurations. You will need to set spark.databricks.hive.metastore.glueCatalog.enabled true and spark.hadoop.hive.metastore.glue.catalogid <aws-account-id-for-glue-catalog> in the spark configuration.
You can configure your cluster with an instance profile if it needs extra permissions to access resources. For example, if the S3 bucket holding the delta lake is in a different AWS account.
Other optional cluster settings
You might want to configure cluster-level permissions, by following the Databricks instructions on cluster access control. Snowplow's RDB Loader must be able to restart the cluster if it is terminated.
If the Databricks account has IP safelisting added, then Snowplow need to know this to supply two NAT IP/Elastic IP addresses that AWS/GCP account will connect across to Databricks on.

 

3. Note the JDBC connection details for the cluster

In the Databricks UI, click on "Compute" in the sidebar
Click on the RDB Loader cluster and navigate to "Advanced options"
Click on the "JDBC/ODBC" tab
On the tab, you will find the follow values that map to the values that Snowplow require:
Field JDBC/ODBC tab value
deltalake_host Server Hostname
deltalake_http_path HTTP Path
deltalake_port Port

 

4. Create access token for the RDB Loader

Note: The access token must not have a specified lifetime. Otherwise, RDB Loader will stop working when the token expires.

Navigate to the user settings in your Databricks workspace
For Databricks hosted on AWS, the "Settings" link is in the lower left corner in the side panel
For Databricks hosted on Azure, "User Settings" is an option in the drop-down menu in the top right corner
Go to the "Access Tokens" tab
Click the "Generate New Token" button
Optionally enter a description (comment). Leave the expiration period empty
Click the "Generate" button
Copy the generated token and store in a secure location

 

5. Create schema (database) 

Note: You can change the name of the schema to be used (the default is snowplow)

-- Uncomment if your want to use a custom Unity catalog
-- and replace with your own value.
-- USE CATALOG <custom_unity_catalog>;

CREATE SCHEMA IF NOT EXISTS snowplow
-- Uncomment if you want tables created by Snowplow to be
-- located in a non-default bucket or directory.
-- LOCATION s3://<custom_location>/
;

 

6. Sharing all the details back with Snowplow 

We require the following destination details:

The service account ID (GCP only)
The Deltalake host, http_path and port
The access token
The schema name
The catalog name, if you are using a custom Unity catalog
 

AWS

Please submit the destination details at the Snowplow BDP Console / Setup Databricks page

GCP

Please submit the details via the Snowplow BDP Console using secure messaging.