How to Identify and Remove Duplicated Events

Edwin Mejias  
Edited

This guide provides SQL queries to first check for duplicates, and then remove them if needed, achieving a proper deduplication on the supported warehouses.

 

Follow these steps to identify and remove duplicate events from the atomic.events table in Snowflake.

SELECT FROM snowplow.atomic.events a
WHERE EXISTS (
    SELECT 1
    FROM (
        SELECT event_id, load_tstamp
        FROM snowplow.atomic.events
        WHERE collector_tstamp  timestamp 'YYYY-MM-DD'
        QUALIFY ROW_NUMBER() OVER (
            PARTITION BY event_id
            ORDER BY load_tstamp
        )  1 OR event_id IS NULL
    ) b
    WHERE a.event_id = b.event_id AND a.load_tstamp = b.load_tstamp
)
AND a.collector_tstamp  timestamp 'YYYY-MM-DD';
DELETE FROM snowplow.atomic.events a
WHERE EXISTS (
    SELECT 1
    FROM (
        SELECT event_id, load_tstamp
        FROM snowplow.atomic.events
        WHERE collector_tstamp  timestamp 'YYYY-MM-DD'
        QUALIFY ROW_NUMBER() OVER (
            PARTITION BY event_id
            ORDER BY load_tstamp
        )  1 OR event_id IS NULL
    ) b
    WHERE a.event_id = b.event_id AND a.load_tstamp = b.load_tstamp
)
AND a.collector_tstamp  timestamp 'YYYY-MM-DD';

Databricks does not support the QUALIFY clause. Use a CTE and ROW_NUMBER() instead.

WITH ranked_events AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY load_tstamp) AS rn
  FROM snowplow.atomic.events
  WHERE collector_tstamp  TIMESTAMP('YYYY-MM-DD')
),
duplicates AS (
  SELECT event_id, load_tstamp
  FROM ranked_events
  WHERE rn  1 OR event_id IS NULL
)
SELECT FROM snowplow.atomic.events
WHERE (event_id, load_tstamp) IN (
  SELECT event_id, load_tstamp FROM duplicates
)
AND collector_tstamp  TIMESTAMP('YYYY-MM-DD');
WITH ranked_events AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY load_tstamp) AS rn
  FROM snowplow.atomic.events
  WHERE collector_tstamp  TIMESTAMP('YYYY-MM-DD')
),
duplicates AS (
  SELECT event_id, load_tstamp
  FROM ranked_events
  WHERE rn  1 OR event_id IS NULL
)
DELETE FROM snowplow.atomic.events
WHERE (event_id, load_tstamp) IN (
  SELECT event_id, load_tstamp FROM duplicates
)
AND collector_tstamp  TIMESTAMP('YYYY-MM-DD');

BigQuery supports QUALIFY and uses STRUCT for tuple-based deletions.

SELECT FROM `project.dataset.atomic.events`
WHERE STRUCT(event_id, load_tstamp) IN (
  SELECT STRUCT(event_id, load_tstamp)
  FROM (
    SELECT event_id, load_tstamp
    FROM `project.dataset.atomic.events`
    WHERE collector_tstamp  TIMESTAMP('YYYY-MM-DD')
    QUALIFY ROW_NUMBER() OVER (
      PARTITION BY event_id
      ORDER BY load_tstamp
    )  1 OR event_id IS NULL
  )
)
AND collector_tstamp  TIMESTAMP('YYYY-MM-DD');
DELETE FROM `project.dataset.atomic.events`
WHERE STRUCT(event_id, load_tstamp) IN (
  SELECT STRUCT(event_id, load_tstamp)
  FROM (
    SELECT event_id, load_tstamp
    FROM `project.dataset.atomic.events`
    WHERE collector_tstamp  TIMESTAMP('YYYY-MM-DD')
    QUALIFY ROW_NUMBER() OVER (
      PARTITION BY event_id
      ORDER BY load_tstamp
    )  1 OR event_id IS NULL
  )
)
AND collector_tstamp  TIMESTAMP('YYYY-MM-DD');

 

Notes:

  • It is a good idea to take a backup before performing deletions.
  • Adjust the date filter (collector_tstamp > 'YYYY-MM-DD') based on when your duplicates took place.
  • Always confirm the data to remove.
  • You might have to adjust downstream modeled data accordingly.

 

deduplication