How to optimise performance of the Unified Digital Model

Alec Moloney
Snowplow Team
Edited

Depending on your data volumes, you may need to adjust the frequency and configuration of the Digital Unified Model to optimise its performance.

 

Configuration variables

Adjusting the below configuration variables may assist with optimisation efforts.

 

lookback_window_hours

The number of hours to look before the latest event processed to account for late arriving data, which comes out of order. Read more in Late Loaded Events on our docs.

Default Implications
6

Partioned on collector: If your data pipeline is stable, this can be decreased to improve performance, but it risks missing events during spikes that will result in late loaded events.

Partioned on loader: You are ok to change it to 0-1 as the late loaded will just have the late loading timestamp you will be fine

 

max_session_days

The maximum allowed session length in days. For a session exceeding this length, all events after this limit will stop being processed. Exists to reduce lengthy table scans that can occur due to long sessions which are usually a result of bots. Read more in Quarantine Table on our docs.

Default Implications
3

If sessions are generally shorter, reducing this value can optimize performance, but setting it too low may prematurely cut off valid sessions.

 

days_late_allowed

The maximum allowed number of days between the event creation and it being sent to the collector. Exists to reduce lengthy table scans that can occur as a result of late arriving data. Read more in Late Sent Events on our docs.

Default Implications
3

It is necessary for apps that support offline event generation; reducing it may increase performance but risks dropping late events.

 

upsert_lookback_days

Number of days to look back over the incremental derived tables during the upsert. Where performance is not a concern, should be set to as long a value as possible. Having too short a period can result in duplicates. Read more in Optimize Upserts on our docs.

Default Implications
30

Lowering it may improve performance but increases the risk of data loss (reduces integrity due to late arrivals). Disabling it will worsen performance so not recommended.

 

session_lookback_days

Number of days to limit scan on snowplow_unified_base_sessions_lifecycle_manifest manifest. Exists to improve performance of model when we have a lot of sessions. Should be set to as large a number as practical.

Default Implications
730

n/a

 

backfill_limit_days

The maximum numbers of days of new data to be processed since the latest event processed. Please refer to the incremental logic section for more details. Read more in Sessionization on our docs.

Default Implications
30

Only affects backfill processes, not regular runs, so tuning this is useful for managing data loads in historical updates.

Was this article helpful?