Version:

How to Migrate a Project to Incremental Beta

Step-by-step guide to migrate an entire Profiles project from non-incremental to incremental mode.

This tutorial walks you through migrating an entire Profiles project from non-incremental to incremental mode. It covers the systematic workflow: auditing your project, configuring inputs, converting features, eliminating input vars, and validating results.

For converting individual features, see How to Make Features Incremental.

Overview

An incremental Profiles project processes only new and changed data on each run, rather than reprocessing all historical data. This reduces compute costs, improves latency, and enables faster data freshness.

Migration follows a specific order of operations:

  1. Audit your project
  2. Verify and mark append-only inputs
  3. Handle ephemeral SQL models
  4. Convert entity vars (add merge clauses)
  5. Eliminate input vars
  6. Validate with --rebase_incremental

Prerequisites

Before starting migration, make sure that:

1: Audit your project

Before making any changes, inventory your project to understand what needs to migrate.

Identify all components

Review your project files and list:

ResourceFile nameDescription
Inputsinputs.yamlAll input sources and their table types
SQL modelssql_models.yamlAll SQL template models, noting which are ephemeral
Entity varsprofiles.yamlAll entity vars, their aggregation functions, and from sources
Input varsprofiles.yamlAll input vars (must be eliminated or converted)

Classify inputs

For each input, determine whether it is append-only or mutable:

Input typeAppend-onlyExamples
Event streams (facts)YesPage views, purchases, clicks, logins
Slowly-changing dimensionsNoUser profiles, account settings, subscription status
Snapshot tablesNoDaily balance snapshots, inventory levels

Check for input vars

Input vars scan all input data on every run and cannot be made incremental. If your project has input vars, you must convert them (see Step 5).

2: Verify and mark append-only inputs

For a model to participate in incremental processing, its upstream inputs must be append-only. This step ensures you correctly identify and mark them.

Verify append-only behavior

Use these three checks to verify an input is truly append-only:

1. Business semantics check

Ask: Is the underlying source a fact table (events) or a slowly-changing dimension?

  • Fact tables (events, transactions, clicks) are typically append-only — rows represent immutable events
  • Dimension tables (user profiles, account settings) are not append-only — rows are updated in place

2. Warehouse verification query

Run this query against your warehouse to check for duplicate or updated rows:

SELECT entity_id_column, COUNT(*)
FROM your_input_table
GROUP BY entity_id_column
HAVING COUNT(*) > 1;

If this returns rows, inspect whether duplicates represent distinct events (append-only) or updates to the same record (not append-only).

3. Downstream usage inspection

Check how the input is used in SQL models and entity vars. If a model selects the “latest” attribute value (such as current subscription status or country), that suggests the input receives updates and is NOT append-only.

warning

When in doubt about whether an input is append-only, treat it as not append-only.

Incorrectly marking a mutable input as append-only leads to incorrect incremental results that are difficult to diagnose.

Mark append-only inputs

For each verified append-only input, add the contract:

inputs:
  - name: events
    table: PROD.ANALYTICS.EVENTS
    contract:
      is_append_only: true
      is_event_stream: true
    occurred_at_col: event_timestamp

Some key fields in the contract are listed below:

FieldDescription
is_append_onlyEnables delta model generation for incremental processing
is_event_streamIndicates this is an event stream
occurred_at_colThe timestamp column used for time-range filtering. Required for incremental inputs.

3: Handle ephemeral SQL models

Ephemeral SQL models defined on top of append-only inputs can also participate in incremental processing.

Identify eligible ephemeral models

Look for SQL models with output_type: ephemeral that:

  • Read from append-only inputs
  • Only filter, rename, or transform columns (no aggregations or joins with mutable sources)
  • Produce output that is itself append-only

Mark ephemeral models as append-only

models:
  - name: purchase_events
    model_type: sql_template
    model_spec:
      occurred_at_col: timestamp
      materialization:
        output_type: ephemeral
      single_sql: |
        {% with events = this.DeRef("inputs/events") %}
          SELECT * FROM {{ events }}
          WHERE event_name = 'purchase'
        {% endwith %}        
      contract:
        is_append_only: true
        is_event_stream: true
warning
Only mark an ephemeral model as is_append_only: true if all of its upstream inputs are also append-only and the model’s SQL only filters or transforms without introducing mutations (no joins with mutable dimension tables, no aggregations that change with updates).

4: Convert entity vars

Add merge clauses to entity vars to enable incremental computation. The merge clause defines how to combine previous feature values with new calculations.

Choose the right pattern

Use the Incremental Features decision guide to determine which pattern applies:

PatternWhen to useReference
Simple aggregationSUM, MIN, MAX, COUNTSimple Aggregations
Compound aggregationAVG, WEIGHTED_AVG, MIN_BY/MAX_BY, arrays, booleansCompound Aggregations
Incremental SQL modelStateful processing, rolling windows, multi-step logicIncremental SQL Models

Quick reference: Common migrations

Simple aggregations: Add a matching merge:

# Before
- entity_var:
    name: first_purchase_time
    select: min(timestamp)
    from: inputs/orders

# After
- entity_var:
    name: first_purchase_time
    select: MIN({{orders.timestamp}})
    merge: MIN({{rowset.first_purchase_time}})
    from: inputs/orders

Window functions: Replace with MAX_BY/MIN_BY:

# Before (non-incremental window function)
- entity_var:
    name: last_login_platform
    select: last_value(platform)
    from: inputs/logins
    window:
      order_by:
        - timestamp

# After (incremental with MAX_BY + helper)
- entity_var:
    name: last_login_platform
    select: MAX_BY({{logins.platform}}, {{logins.timestamp}})
    merge: MAX_BY({{rowset.last_login_platform}}, {{rowset.last_login_platform_by_param}})
    from: inputs/logins

- entity_var:
    name: last_login_platform_by_param
    select: MAX({{logins.timestamp}})
    merge: MAX({{rowset.last_login_platform_by_param}})
    from: inputs/logins
    is_feature: false

See Compound Aggregations for the full MIN_BY/MAX_BY pattern and _by_param helper documentation.

When merge is not needed

Not every entity var requires a merge clause:

  • Derived-only vars that reference other entity vars (no from clause) are recomputed from their incrementally maintained dependencies:

    - entity_var:
        name: days_since_last_seen
        select: "DATEDIFF(day, {{user.last_seen}}, {{ end_time_sql }})"
        description: Days since the user was last seen
    
  • Vars from window-bounded incremental tables that read from an incremental SQL model with a sliding window. The model already maintains state incrementally, and the var recomputes from the bounded window each run:

    - entity_var:
        name: active_days_in_past_7_days
        select: count(distinct active_day)
        from: models/recent_active_days
        where: active_day >= DATEADD(day, -7, {{ end_time_sql }})
    

5: Eliminate input vars

Input vars scan all data on every run and block incremental processing. You must convert or replace every input var before the project can run incrementally.

Decision tree

For each input var, follow this decision tree:

Input Var
+-- Can this logic be expressed as an entity var?
|   +-- Yes: Convert to entity_var with merge clause
|   |   (common for window functions like LAST_VALUE, FIRST_VALUE)
|   +-- No
|       +-- Can the input var be replaced by a SQL model?
|       |   +-- Yes: Create a SQL model (possibly incremental) and
|       |   |   reference it from entity vars
|       |   +-- No
|       |       +-- Create an incremental SQL model with full
|       |           DeRef pattern

Example: Convert input var to entity var

This is the most common conversion. Window functions like LAST_VALUE and FIRST_VALUE can typically be replaced with MAX_BY and MIN_BY:

# Before (input var with window function)
- input_var:
    name: last_known_city
    select: last_value(city)
    from: inputs/events
    window:
      partition_by:
        - user_id
      order_by:
        - timestamp
    where: city IS NOT NULL

# After (entity var with MAX_BY)
- entity_var:
    name: last_known_city
    select: MAX_BY({{events.city}}, {{events.timestamp}})
    merge: MAX_BY({{rowset.last_known_city}}, {{rowset.last_known_city_by_param}})
    from: inputs/events
    where: {{events.city}} IS NOT NULL

- entity_var:
    name: last_known_city_by_param
    select: MAX({{events.timestamp}})
    merge: MAX({{rowset.last_known_city_by_param}})
    from: inputs/events
    where: {{events.city}} IS NOT NULL
    is_feature: false

Example: Replace input var with incremental SQL model

When input var logic is too complex for an entity var (multi-step transformations, complex joins), create an incremental SQL model:

# In sql_models.yaml
models:
  - name: recent_active_days
    model_type: sql_template
    model_spec:
      occurred_at_col: timestamp
      materialization:
        output_type: table
      single_sql: |
        {%- set lastThis = this.DeRef(pre_existing=true, dependency="optional", checkpoint_name="baseline") -%}
        {% set events_delta = this.DeRef("inputs/events/incr_delta_tbl", dependency="coercive", baseline_name="baseline", prereqs=[lastThis]) %}
        {% set events_full = this.DeRef("inputs/events", prereqs=[events_delta.Except()]) %}

        SELECT user_id, active_day
        FROM (
          {%- if lastThis.IsEnabled() %}
            SELECT user_id, active_day FROM {{ lastThis }}
            UNION ALL
            SELECT DISTINCT user_id, CAST(timestamp AS DATE) AS active_day
            FROM {{ events_delta }}
          {%- else %}
            SELECT DISTINCT user_id, CAST(timestamp AS DATE) AS active_day
            FROM {{ events_full }}
          {%- endif %}
        )
        WHERE active_day >= DATEADD('day', -90, {{ end_time_sql }})        
      ids:
        - select: user_id
          type: user_id
          entity: user
      contract:
        is_append_only: false

Then reference the model from an entity var:

- entity_var:
    name: active_days_last_30
    select: COUNT(DISTINCT active_day)
    from: models/recent_active_days
    where: active_day >= DATEADD('day', -30, {{ end_time_sql }})

6: Validate with rebase_incremental

After completing the migration, validate that incremental results match full refresh results.

Validation workflow

  1. Run a full refresh to establish a baseline:

    pb run
    
  2. Run again to trigger incremental processing (if new data has arrived):

    pb run
    
  3. Run with --rebase_incremental to force a full rebuild and compare:

    pb run --rebase_incremental
    
  4. Compare results: Verify that incremental and full refresh outputs match by comparing row counts, aggregated values, and spot-checking individual entities.

Checklist

  • Row counts: Entity var tables should have the same number of rows
  • Value consistency: Feature values should match between incremental and full refresh runs
  • Performance: Incremental runs should be faster than the full refresh baseline (after the first run)
  • No errors: Check logs for warnings or errors related to incremental processing
warning
The first run after migration always performs a full refresh to establish the baseline checkpoint. Performance improvements are visible starting from the second run.

Migration checklist

This section contains a detailed migration checklist to convert your project to incremental mode.

Pre-migration

  • Inventory all inputs, SQL models, entity vars, and input vars
  • Classify each input as append-only or mutable
  • Verify append-only inputs with warehouse queries
  • Identify all input vars that need conversion

During migration

  • Mark all append-only inputs with is_append_only: true and occurred_at_col
  • Mark eligible ephemeral SQL models with is_append_only: true
  • Add merge clauses to all entity vars that read from append-only sources
  • Replace window functions with MAX_BY/MIN_BY patterns
  • Convert or replace all input vars
  • Create incremental SQL models for complex patterns
  • Use {{ end_time_sql }} instead of CURRENT_DATE or CURRENT_TIMESTAMP() in rolling-window filters

Post-migration

  • Run pb run to establish baseline
  • Run pb run again to verify incremental processing
  • Run pb run --rebase_incremental and compare results
  • Verify performance improvement on subsequent runs
  • Remove any unused input vars or deprecated models

Common pitfalls

PitfallDescriptionSolution
Missing merge clauseEntity vars without merge cannot run incrementallyAdd appropriate merge clause matching the select aggregation
Incorrect merge functionUsing COUNT instead of SUM for count mergesFor COUNT(...) in select, use SUM({{rowset.feature}}) in merge
Missing _by_param helperUsing MAX_BY/MIN_BY without the ordering companion varAlways define a _by_param helper var with is_feature: false
Mutable input marked append-onlyDimension table incorrectly marked as is_append_only: trueVerify append-only behavior before marking; dimension tables should NOT be append-only
Using CURRENT_DATE in filtersNon-deterministic results across re-runsUse {{ end_time_sql }} instead
Input vars still presentInput vars block incremental processingConvert to entity vars or incremental SQL models
Non-composable merge functionUsing AVG directly in mergeDecompose into SUM/COUNT components via Compound Aggregations

See more

Questions? We're here to help.

Join the RudderStack Slack community or email us for support