Incremental Transforms

You can configure a transform to rebuild only new and updated rows rather than the whole table, which can help you save on time and compute as your data grows.

By default, transforms fully rebuild your table on every run by running a CREATE TABLE statement using your provided SQL. However, sometimes transform tables can become too large to fully rebuild on every run, or take a long time to run. In this case, you may want to configure the transform to be incremental. Incremental transforms can build on existing tables, only adding new and updated rows. This can save time and compute as your data grows and changes.

Prerequisites

Incremental Transforms is an advanced feature. Build a regular Transform to learn the basics before attempting this.

Making your transform incremental

To configure the transform to be incremental, click on the Edit icon to open the advanced options modal, go to the Incremental Settings tab, then toggle incremental mode on as shown below:

Setting an update key

The update key specifies what column to use when identifying existing rows that need to be updated. If the update key is not set, every row produced by your transform SQL will be appended to your transform table. With an update key, rows which have the same value for that key in the target table as a row produced by your transform SQL will be updated. The values for the final row will be taken from the row produced by your transform SQL.

You can choose an update key from any of the columns in your transform table. The update key should be unique within your table, so that the rows can be matched successfully for updating.

For example, say you had a transform that built a users table with SQL like this:

SELECT user_id, email, updated_at 
FROM raw_users

You could then make this transform incremental like this:

SELECT user_id, email, updated_at 
FROM raw_users
{% if incremental %}
WHERE updated_at > (
SELECT MAX(updated_at)
FROM mozart.users
)
{% endif %}

You would then choose user_id as your update key. When running in incremental mode, this transform would select only user records that have been updated more recently than the records in your existing transform table. By setting the user_id update key, the recently updated records would update existing rows that have the same user_id as the selected rows, rather than creating duplicates. Any record with an user_id that was not already in the table would be appended.

Auto-rebuild mode

If auto-rebuild mode is enabled, when your transform runs and produces a table with columns which do not match your target table, your transform will run again in full rebuild (non-incremental) mode, recreating your transform table with the new columns. If auto-rebuild mode is disabled, your transform run will fail and you will be notified. You can then manually rebuild the table, or modify your SQL so that it matches the columns of your target table.

(We think most users will want this enabled 90% of the time. If rebuilding your transform will be a particularly expensive operation or potentially lose data that cannot be recomputed by rerunning the new version of this transform, you should toggle this off.)

Writing incremental SQL

To decide which rows are updated when an incremental transform runs, you can use a small incremental macro. This macro will instruct Mozart which part of your SQL query to include on an incremental transform run. Usually, you will use the macro to wrap a SQL WHERE clause. This clause should limit your SQL to produce only rows which will be updates. When the clause wrapped in the macro is removed, your SQL should produce the entire transform dataset.

For example, say you have a transform that constructs a table app_events, built from a raw events table. To start, you may have this SQL:

 
SELECT event_time, event_id, user_id, event_type 
FROM raw_events

If this table becomes too large to build in full, you could make the transform incremental with this SQL:

SELECT event_time, event_id, user_id, event_type 
FROM raw_events
{% if incremental %}
WHERE event_time > (
SELECT MAX(event_time)
FROM mozart.app_events
)
{% endif %}

Here, you have added a WHERE clause that only selects newer events than you already have in your transform. The incremental macro ensures that this WHERE clause is included when the transform runs in incremental mode. 

Testing your SQL

When your transform is in incremental mode, you will see two run buttons on the Transform tab below the SQL editor:

The Run Rebuild SQL button will evaluate the incremental macro to false, so the wrapped section will be removed from the final SQL. Use this button to make sure your SQL produces the full dataset correctly when running in rebuild mode. With the transform SQL above, this button would run the following SQL:

SELECT event_time, event_id, user_id, event_type 
FROM raw_events

The Run Incremental button will evaluate the incremental macro to true, meaning the wrapped section will be present in the final SQL. Use this button to make sure your SQL produces the correct rows for updates when running in incremental mode. With the transform SQL above, this button would run the following SQL:

SELECT event_time, event_id, user_id, event_type 
FROM raw_events
WHERE event_time > (
SELECT MAX(event_time)
FROM mozart.app_events
)

How incremental transforms run

The first time your transform runs, it will always run in non-incremental mode, since it needs to first create the target table. Your transform will run in incremental mode if all of the following are true:

  • Your transform has been set to incremental
  • The transform table exists.
  • Your transform SQL produces a table with the same columns as your existing transform table

The behavior when your transform SQL produces a columns that does not match your existing transform table will depend on whether you have enabled auto-rebuild mode (see below).

Debugging

To debug an incremental transform, you may use the transform runs table pictured here. This table will indicate the whether the run was in incremental mode. In incremental mode, the table will also provide the number of rows inserted or updated via the Inserted and Updated columns. If the transform is run in a non-incremental mode, Row Count column will be updated but Inserts and Updates will be blank. Note that if the columns change and auto-rebuild toggle is turned on, the transform will run in rebuild mode. Rebuilds may take a while on a large table; therefore, we suggest turning off the auto-rebuild toggle while you are debugging.

If you need to know when a row was updated by the incremental transform it is useful to put current_timestamp() as "increment_time" in the select clause of your query. 

Best Practices

  • If you change the formula your transform uses to calculate a value, you may want to manually run the transform in Rebuild mode to make sure every row in the table has that value calculated in the same way.