Version:

How to Make Features Incremental Beta

Learn how to convert features to incremental mode in your Profiles project.

This guide walks you through converting features to incremental mode in Profiles.

Prerequisites

Before making a feature incremental, make sure that:

warning

Not all features can be incremental.

Features that require scanning all historical data (such as median or percentile calculations) cannot use incremental computation. For such cases, consider using Timegrains and bundling.

Step 1: Configure input contracts

Before making features incremental, you must configure all input sources and SQL models with the is_append_only contract. This applies to inputs and SQL models (which can also be ephemeral).

Input example

Add the following to each input in your inputs.yaml:

inputs:
  - name: orders
    table: PROD.ANALYTICS.ORDERS
    contract:
      is_append_only: true
      is_event_stream: true
    occurred_at_col: order_timestamp

SQL model example

You can also create append-only SQL models that filter or transform inputs. Add this to your sql_models.yaml:

models:
  - name: purchase_events
    model_type: sql_template
    model_spec:
      occurred_at_col: timestamp
      materialization:
        output_type: table
      single_sql: |
        {% with events = this.DeRef("inputs/events") %}
          SELECT * FROM {{ events }}
          WHERE event_name = 'purchase'
        {% endwith %}        
      contract:
        is_append_only: true
        is_event_stream: true

This SQL model filters events to only purchases and is marked as is_append_only: true, making it usable as an incremental source.

warning
All inputs and SQL models used by incremental features must have contract: is_append_only: true. SQL models can also be ephemeral. Without this configuration, the project will throw an error when you try to run incremental features.

Complete example: Simple incremental feature

The following example shows a complete setup for a simple incremental feature that tracks total purchase value:

Input configuration (inputs.yaml):

inputs:
  - name: orders
    table: PROD.ANALYTICS.ORDERS
    contract:
      is_append_only: true
      is_event_stream: true
    occurred_at_col: order_timestamp

Entity var (profiles.yaml):

- entity_var:
    name: total_purchases
    select: SUM({{orders.value}})
    merge: SUM(rowset.total_purchases)
    from: inputs/orders
    description: Total purchase value across all orders

This example shows:

  • Input configured with contract: is_append_only: true
  • Entity var using SUM aggregation with merge: SUM(rowset.total_purchases)
  • Profiles automatically handles incremental computation

Step 2: Identify the feature pattern

Determine which incremental pattern your feature matches:

See Incremental Features for a pattern selection guide and detailed examples.

Step 3: Add the merge property

Add a merge: property to your entity_var definition in your profiles.yaml file. This property defines how to combine previous feature values with new data.

Simple aggregators

For MIN, MAX, SUM, COUNT, and similar functions:

- entity_var:
    name: first_event_time
    select: MIN({{events.timestamp}})
    merge: MIN(rowset.first_event_time)
    from: inputs/events
    description: Earliest timestamp of any event

Average using component features

For AVG, break it into sum and count components. Define three separate entity_var entries at the same level in your profiles.yaml:

- entity_var:
    name: sum_of_scores
    select: SUM({{game_events.score}})
    merge: SUM(rowset.sum_of_scores)
    from: inputs/game_events
    is_feature: false
    description: Sum of scores (used for average calculation)

- entity_var:
    name: count_of_scores
    select: COUNT({{game_events.score}})
    merge: SUM(rowset.count_of_scores)
    from: inputs/game_events
    is_feature: false
    description: Count of scores (used for average calculation)

- entity_var:
    name: avg_score
    select: "CASE WHEN {{user.count_of_scores}} > 0 THEN {{user.sum_of_scores}} / {{user.count_of_scores}} ELSE 0 END"
    description: Average score across all games
info

Features referencing other features

When a feature references another feature (like {{user.sum_of_scores}}), it doesn’t need a merge: clause if it’s just performing a calculation on incrementally maintained features. The referenced features (sum_of_scores and count_of_scores) maintain their incremental state, and the calculation (avg_score) is computed on-the-fly from those values.

Use incremental sources

You can also make features incremental by pointing to an incremental source (such as a delta model or incremental input window). Even if the entity_var SQL runs a full aggregation, the scan volume stays small because the source is already delta-restricted:

- entity_var:
    name: recent_purchases
    select: SUM({{recent_orders.value}})
    from: inputs/orders/incr_delta_tbl
    description: Purchases from recent data only

Step 4: Define merge logic

The merge: clause must match the aggregation function in select:. Common patterns:

Select functionMerge function
MIN(...)MIN(rowset.feature_name)
MAX(...)MAX(rowset.feature_name)
SUM(...)SUM(rowset.feature_name)
COUNT(...)SUM(rowset.feature_name)
ARRAY_AGG(...)ARRAY_UNION_AGG(rowset.feature_name)

For complete merge syntax, see Simple Aggregations and Compound Aggregations.

Which merge function should I use?

The merge function must match your aggregation type. For COUNT, always use SUM in the merge property (summing counts, not counting counts). See the table above for the correct merge function for each aggregation type.

Step 5: Use merge_where when needed

For complex merges, use merge_where: to restrict merging to only changed rows. Define separate entity_var entries at the same level in your profiles.yaml:

- entity_var:
    name: timestamp
    select: MAX({{game_events.timestamp}})
    merge: MAX(rowset.timestamp)
    from: inputs/game_events
    where: {{game_events.event_type}} = "BREAK"
    is_feature: false
    description: Timestamp of the most recent break event (used for score lookup)

- entity_var:
    name: score_at_last_break
    select: any_value({{game_events.score}})
    where: {{game_events.timestamp}} = {{user.timestamp}}
    merge: any_value({{game_events.score}})
    merge_where: rowset.timestamp = {{user.timestamp}}
    from: inputs/game_events
    description: Score at the most recent break event

The timestamp entity var is marked with is_feature: false since it’s an intermediate calculation used by score_at_last_break.

tip
Tip: Mark helper entity vars with is_feature: false when they’re intermediate calculations used by other features. This prevents them from appearing as features in your output while still allowing them to be referenced by other entity vars.

Step 6: Test your implementation

  1. Run pb run to verify the feature computes correctly
  2. Compare incremental results with a full refresh to ensure accuracy
  3. Monitor run times to verify improvement
warning
The first run performs a full refresh to establish the baseline checkpoint. Subsequent runs use incremental computation.

Verifying incremental behavior

To verify your feature is running incrementally:

  • Check run times: Incremental runs should be faster than full refresh runs (after the first run)
  • Compare results: Run with incremental mode and compare results with a full refresh to ensure they match
  • Monitor checkpoint usage: Verify that checkpoints are being created and used in subsequent runs

Troubleshooting

IssueSolution
Feature values are incorrect
  • Verify that your merge: clause correctly matches the aggregation logic
  • Ensure that your input sources are append-only or have reliable timestamps
  • Check that you are not missing edge cases in merge logic
  • Compare incremental results with a full refresh to identify discrepancies
Performance not improving
  • Verify that your input sources are properly configured for incremental processing
  • Check that delta models are being used correctly
  • Ensure your data warehouse supports efficient timestamp filtering
  • Monitor checkpoint creation and usage
Merge errors
  • Verify that all aggregators in select: have corresponding merge logic
  • Check that merge functions match the aggregation type
  • Ensure merge_where: conditions are correct if used
  • Review Simple Aggregations and Compound Aggregations for correct syntax
The first run is slowThis is expected. The first run always performs a full refresh to establish the baseline checkpoint. Subsequent runs use incremental computation and should be faster.

See more


Questions? Contact us by Email or on Slack