Before you Start
Currently, Snowplow does not Support setting up dbt Models using Lake Loaders and are by no means experts at Spark configurations. Help with any debugging/setup related issues will likely be out of scope according to our Statement of Support.
Overview
Running Snowplow dbt packages with Iceberg on Amazon EMR requires careful setup of:
- The EMR cluster and software configuration
- Spark and Iceberg session properties
- dbt profiles
- S3 storage for Iceberg tables
It is critical to align configurations exactly, particularly regarding the Spark Catalog name and storage paths.
EMR and Software Requirements
- EMR Version: 7.4.0 or later recommended
- Spark Version: 3.5.2
- Hive Version: 3.1.3
- Hadoop Version: 3.4.0
- Iceberg JARs: Ensure iceberg-spark3-runtime.jar is installed
- dbt Versions:
- dbt-core==1.9.2
- dbt-spark==1.9.1
Spark Configuration for Iceberg
Set the following properties on your EMR cluster:
[ { "Classification": "iceberg-defaults", "Properties": { "iceberg.enabled": "true" } }, { "Classification": "spark-defaults", "Properties": { "spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar", "spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkCatalog", "spark.sql.catalog.spark_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog", "spark.sql.catalog.spark_catalog.warehouse": "s3://your-bucket-name/", "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" } } ]
Important:
- The catalog must be named
spark_catalog
. Renaming it (e.g., toglue_catalog
) will break Snowplow models. - The warehouse must point to the specific S3 bucket where Iceberg tables will be stored.
dbt profiles.yml Configuration
Example configuration:
your-profile-name: outputs: dev: type: spark method: thrift host: <emr-master-dns> port: 10001 schema: your_schema threads: 4 session_properties: spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.spark_catalog.type: hive spark.sql.catalog.spark_catalog.warehouse: "s3://your-bucket-name/" spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions target: dev
Key points:
- Use
method: thrift
. - Connect to port 10001 where Spark Thrift Server is running.
- Include Iceberg session properties.
Deployment Checklist
- Launch EMR with correct Spark/Iceberg configurations
- Start Thrift server:
sudo /usr/lib/spark/sbin/start-thriftserver.sh
- Connect to the cluster using dbt and Thrift
- Run
dbt debug
to confirmtarget.type = spark
- Run
dbt run
to deploy models
Common Pitfalls
- Catalog name not set to
spark_catalog
- Warehouse path missing or incorrect
- Iceberg runtime JARs not installed properly
- Old or incorrect
snowplow__start_date
Special Note: Databricks tblproperties
Some Snowplow dbt models include the following tblproperties
in their compiled SQL:
tblproperties ('delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true')
These properties are intended for Databricks environments to optimize Delta Lake tables. When running on EMR with Iceberg, they may appear but are generally harmless.
- Spark on EMR usually ignores these properties.
- Warnings may appear, but they do not impact successful table creation.
- There is no need to remove or modify these properties for EMR deployments.
If table creation errors persist, advanced users can try removing the tblproperties
section from the compiled SQL and re-running manually as a troubleshooting step.