This guide provides SQL queries to first check for duplicates, and then remove them if needed, achieving a proper deduplication on the supported warehouses.
Follow these steps to identify and remove duplicate events from the
atomic.events
table in Snowflake.
SELECT FROM snowplow.atomic.events a WHERE EXISTS ( SELECT 1 FROM ( SELECT event_id, load_tstamp FROM snowplow.atomic.events WHERE collector_tstamp timestamp 'YYYY-MM-DD' QUALIFY ROW_NUMBER() OVER ( PARTITION BY event_id ORDER BY load_tstamp ) 1 OR event_id IS NULL ) b WHERE a.event_id = b.event_id AND a.load_tstamp = b.load_tstamp ) AND a.collector_tstamp timestamp 'YYYY-MM-DD';
DELETE FROM snowplow.atomic.events a WHERE EXISTS ( SELECT 1 FROM ( SELECT event_id, load_tstamp FROM snowplow.atomic.events WHERE collector_tstamp timestamp 'YYYY-MM-DD' QUALIFY ROW_NUMBER() OVER ( PARTITION BY event_id ORDER BY load_tstamp ) 1 OR event_id IS NULL ) b WHERE a.event_id = b.event_id AND a.load_tstamp = b.load_tstamp ) AND a.collector_tstamp timestamp 'YYYY-MM-DD';
Databricks does not support the QUALIFY
clause. Use a CTE
and ROW_NUMBER()
instead.
WITH ranked_events AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY load_tstamp) AS rn FROM snowplow.atomic.events WHERE collector_tstamp TIMESTAMP('YYYY-MM-DD') ), duplicates AS ( SELECT event_id, load_tstamp FROM ranked_events WHERE rn 1 OR event_id IS NULL ) SELECT FROM snowplow.atomic.events WHERE (event_id, load_tstamp) IN ( SELECT event_id, load_tstamp FROM duplicates ) AND collector_tstamp TIMESTAMP('YYYY-MM-DD');
WITH ranked_events AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY load_tstamp) AS rn FROM snowplow.atomic.events WHERE collector_tstamp TIMESTAMP('YYYY-MM-DD') ), duplicates AS ( SELECT event_id, load_tstamp FROM ranked_events WHERE rn 1 OR event_id IS NULL ) DELETE FROM snowplow.atomic.events WHERE (event_id, load_tstamp) IN ( SELECT event_id, load_tstamp FROM duplicates ) AND collector_tstamp TIMESTAMP('YYYY-MM-DD');
BigQuery supports QUALIFY
and uses STRUCT
for
tuple-based deletions.
SELECT FROM `project.dataset.atomic.events` WHERE STRUCT(event_id, load_tstamp) IN ( SELECT STRUCT(event_id, load_tstamp) FROM ( SELECT event_id, load_tstamp FROM `project.dataset.atomic.events` WHERE collector_tstamp TIMESTAMP('YYYY-MM-DD') QUALIFY ROW_NUMBER() OVER ( PARTITION BY event_id ORDER BY load_tstamp ) 1 OR event_id IS NULL ) ) AND collector_tstamp TIMESTAMP('YYYY-MM-DD');
DELETE FROM `project.dataset.atomic.events` WHERE STRUCT(event_id, load_tstamp) IN ( SELECT STRUCT(event_id, load_tstamp) FROM ( SELECT event_id, load_tstamp FROM `project.dataset.atomic.events` WHERE collector_tstamp TIMESTAMP('YYYY-MM-DD') QUALIFY ROW_NUMBER() OVER ( PARTITION BY event_id ORDER BY load_tstamp ) 1 OR event_id IS NULL ) ) AND collector_tstamp TIMESTAMP('YYYY-MM-DD');
Notes:
- It is a good idea to take a backup before performing deletions.
- Adjust the date filter (
collector_tstamp > 'YYYY-MM-DD'
) based on when your duplicates took place. - Always confirm the data to remove.
- You might have to adjust downstream modeled data accordingly.