Tips for setting up dbt Models with Iceberg

Kyla Oyamot  
Edited

Before you Start

Currently, Snowplow does not Support setting up dbt Models using Lake Loaders and are by no means experts at Spark configurations. Help with any debugging/setup related issues will likely be out of scope according to our Statement of Support.

Overview

Running Snowplow dbt packages with Iceberg on Amazon EMR requires careful setup of:

  • The EMR cluster and software configuration
  • Spark and Iceberg session properties
  • dbt profiles
  • S3 storage for Iceberg tables

It is critical to align configurations exactly, particularly regarding the Spark Catalog name and storage paths.

EMR and Software Requirements

  • EMR Version: 7.4.0 or later recommended
  • Spark Version: 3.5.2
  • Hive Version: 3.1.3
  • Hadoop Version: 3.4.0
  • Iceberg JARs: Ensure iceberg-spark3-runtime.jar is installed
  • dbt Versions:
    • dbt-core==1.9.2
    • dbt-spark==1.9.1

Spark Configuration for Iceberg

Set the following properties on your EMR cluster:

[
  {
    "Classification": "iceberg-defaults",
    "Properties": {
      "iceberg.enabled": "true"
    }
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar",
      "spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkCatalog",
      "spark.sql.catalog.spark_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
      "spark.sql.catalog.spark_catalog.warehouse": "s3://your-bucket-name/",
      "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
    }
  }
]

Important:

  • The catalog must be named spark_catalog. Renaming it (e.g., to glue_catalog) will break Snowplow models.
  • The warehouse must point to the specific S3 bucket where Iceberg tables will be stored.

dbt profiles.yml Configuration

Example configuration:

your-profile-name:
  outputs:
    dev:
      type: spark
      method: thrift
      host: <emr-master-dns>
      port: 10001
      schema: your_schema
      threads: 4
      session_properties:
        spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkCatalog
        spark.sql.catalog.spark_catalog.type: hive
        spark.sql.catalog.spark_catalog.warehouse: "s3://your-bucket-name/"
        spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  target: dev

Key points:

  • Use method: thrift.
  • Connect to port 10001 where Spark Thrift Server is running.
  • Include Iceberg session properties.

Deployment Checklist

  • Launch EMR with correct Spark/Iceberg configurations
  • Start Thrift server:
    sudo /usr/lib/spark/sbin/start-thriftserver.sh
    
  • Connect to the cluster using dbt and Thrift
  • Run dbt debug to confirm target.type = spark
  • Run dbt run to deploy models

Common Pitfalls

  • Catalog name not set to spark_catalog
  • Warehouse path missing or incorrect
  • Iceberg runtime JARs not installed properly
  • Old or incorrect snowplow__start_date

Special Note: Databricks tblproperties

Some Snowplow dbt models include the following tblproperties in their compiled SQL:

tblproperties ('delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true')

These properties are intended for Databricks environments to optimize Delta Lake tables. When running on EMR with Iceberg, they may appear but are generally harmless.

  • Spark on EMR usually ignores these properties.
  • Warnings may appear, but they do not impact successful table creation.
  • There is no need to remove or modify these properties for EMR deployments.

If table creation errors persist, advanced users can try removing the tblproperties section from the compiled SQL and re-running manually as a troubleshooting step.