Feature table

Step-by-step tutorial on how to create a feature table.

Once you have done identity stitching to unify the identity of your users across all the cross-platforms, you can evaluate and maintain the required features/traits for each identified user in a feature table.

This guide provides a detailed walkthrough on how to use a PB project and create output tables in a warehouse for a feature table model.

Prerequisites

  • Familiarize yourself with:
    • A basic Profile Builder project by following the Profile Builder CLI steps.
    • Structure of a Profile Builder project and the parameters used in different files.
    • Identity stitching model as the feature table reuses its output to extract the required features/traits.

Sample project

This sample project uses the output of an identity stitching model as an input to create a feature table. The following sections describe how to define your PB project files:

Project detail

The pb_project.yaml file defines the project details such as name, schema version, connection name and the entities which represent different identifiers.

You can define all the identifiers from different input sources you want to stitch together as a user_main_id:

warning
You need to add main_id to the list only if you have defined main_id_type: main_id in the ID stitcher buildspec.
# Project name
name: sample_id_stitching
# Project's yaml schema version
schema_version: 42
# Warehouse connection
connection: test
# Allow inputs without timestamps
include_untimed: true
# Folder containing models
model_folders:
  - models
# Entities in this project and their ids.
entities:
  - name: user
    id_stitcher: models/sample_id_stitcher # name of ID stitcher model, prefixed with relative path of models folder where it's defined
    id_types:
      - main_id # You need to add ``main_id`` to the list only if you have defined ``main_id_type: main_id`` in the id stitcher buildspec.
      - user_id # one of the identifier from your data source.
      - email
id_types:
  - name: main_id
  - name: user_id
    filters: # Multiple filters like exclude, include can be added using value or regex match.
    - type: exclude
      value: ""
  - name: email
    filters: # An include filter example to consider values matching provided regex. Note that this will automatically be anchored by sql to include the beginning and ending regex special chars (^ and $).
    - type: include
      regex: ".+@.+" # Automatically anchored, so equivalent to "^.+@.+$"

Input

The input file file includes the input table references and corresponding SQL for the above-mentioned entities:

inputs:
- name: rsIdentifies
  contract: # constraints that a model adheres to
    is_optional: false
    is_event_stream: true
    with_entity_ids:
      - user
    with_columns:
      - name: timestamp
      - name: user_id
      - name: anonymous_id
      - name: email
  app_defaults:
    table: rudder_events_production.web.identifies # one of the WH table RudderStack generates when processing identify or track events.
    occurred_at_col: timestamp
    ids:
      - select: "user_id" # kind of identity sql to pick this column from above table.
        type: user_id
        entity: user # as defined in project file
        to_default_stitcher: true
      - select: "anonymous_id"
        type: anonymous_id
        entity: user
        to_default_stitcher: true
      - select: "lower(email)" # can use sql.
        type: email
        entity: user
        to_default_stitcher: true
- name: rsTracks
  contract:
    is_optional: false
    is_event_stream: true
    with_entity_ids:
      - user
    with_columns:
      - name: timestamp
      - name: user_id
      - name: anonymous_id
  app_defaults:
    table: rudder_events_production.web.tracks # another table in WH maintained by RudderStack processing track events.
    occurred_at_col: timestamp
    ids:
      - select: "user_id"
        type: user_id
        entity: user
        to_default_stitcher: true
      - select: "anonymous_id"
        type: anonymous_id
        entity: user
        to_default_stitcher: true

Model

Profiles Feature table model lets you define the specific features/traits you want to evaluate from the huge spread of scattered data in your warehouse tables.

A sample profiles.yaml file specifying a feature table model (user_profile):

models:
  - name: user_profile
    model_type: feature_table_model
    model_spec:
      validity_time: 24h # 1 day
      entity_key: user
      vars:
        - entity_var:
            name: first_seen
            select: min(timestamp::date)
            from: inputs/rsTracks
            where: properties_country is not null and properties_country != ''
        - entity_var:
            name: last_seen
            select: max(timestamp::date)
            from: inputs/rsTracks
        - entity_var:
            name: user_lifespan
            select: last_seen - first_seen
            description: Life Time Value of a customer
        - entity_var:
            name: days_active
            select: count(distinct timestamp::date)
            from: inputs/rsTracks
            description: No. of days a customer was active
        - entity_var:
            name: campaign_source
            default: "'organic'"
        - entity_var:
            name: user_rank
            default: -1
        - entity_var:
            name: campaign_source_first_touch
            select: first_value(context_campaign_source)
            window:
                order_by:
                    - timestamp asc
                partition_by:
                    - main_id
            from: inputs/rsIdentifies
            where: context_campaign_source is not null and context_campaign_source != ''
        - input_var:
            name: num_c_rank_num_b_partition
            select: rank()
            from: inputs/tbl_c
            default: -1
            window:
              partition_by:
                - '{{tbl_c}}.num_b'
              order_by:
                - '{{tbl_c}}.num_c asc'
            where: '{{tbl_c}}.num_b >= 10'
        - entity_var:
            name: min_num_c_rank_num_b_partition
            select: min(num_c_rank_num_b_partition)
            from: inputs/tbl_c
      features:
        - user_lifespan
        - days_active
        - min_num_c_rank_num_b_partition
Model specification fields
FieldData typeDescription
validity_timeTimeSpecifies the validity of the model with respect to its timestamp. For example, a model run as part of a scheduled nightly job for 2009-10-23 00:00:00 UTC with validity_time: 24h would still be considered potentially valid and usable for any run requests, which do not require precise timestamps between 2009-10-23 00:00:00 UTC and 2009-10-24 00:00:00 UTC. This specifies the validity of generated feature table. Once the validity is expired, scheduling takes care of generating new tables. For example: 24h for 24 hours, 30m for 30 minutes, 3d for 3 days
entity_keyStringSpecifies the relevant entity from your input.yaml file.
varsListSpecifies variables, with the help of entity_var and input_var.
featuresStringSpecifies the list of name in entity_var, that must act as a feature.

entity_var

The entity_var field provides inputs for the feature table model. This variable stores the data temporarily, however, you can choose to store its data permanently by specifying the name in it as a feature in the features key.

FieldData typeDescription
nameStringName of the entity_var to identify it uniquely.
selectStringColumn name/value you want to select from the table. This defines the actual value that will be stored in the variable. You can use simple SQL expressions or select an entity_var that is declared above it in the same model file. It has to be an aggregate operation that ensures the output is a unique value for a given main_id. For example: min(timestamp), count(*), sum(amount) etc. This holds true even when a window function (optional) is used. For example:: first_value(), last_value() etc are valid while rank(), row_number(), etc. are not valid and give unpredictable results.
fromListReference to the source table from where data is to be fetched. You can either refer to another model from the same YAML or some other table specified in input YAML.
whereStringAny filters you want to apply on the input table before selecting a value. This must be SQL compatible and should consider the data type of the table.
defaultStringDefault value in case no data matches the filter. When defining default values, make sure you enclose the string values in single quotes followed by double quotes to avoid SQL failure. However, you can use the non-string values without any quotes.
descriptionStringTextual description of the entity_var.
windowObjectSpecifies the window function. Window functions in SQL usually have both partition_by and order_by properties. But for entity_var, partition_by is added with main_id as default; so, adding partition_by manually is not supported. If you need partitioning on other columns too, check out input_var where partition_by on arbitrary and multiple columns is supported.

input_var

The syntax of input_var is similar to entity_var, with the only difference that instead of each value being associated to a row of the feature table, it’s associated with a row of the specified input. While you can think of an entity_var as adding a helper column to the feature table, you can consider an input_var as adding a helper column to the input.

FieldData typeDescription
nameStringName to store the retrieved data.
selectStringData to be stored in the name.
fromListReference to the source table from where data is to be fetched.
whereString(Optional) Applies conditions for fetching data.
defaultString(Optional) Default value for any entity for which the calculated value would otherwise be NULL.
descriptionString(Optional) Textual description.
windowObject(Optional) Specifies a window over which the value should be calculated.

window

FieldData typeDescription
partition_byString(Optional) List of SQL expressions to use in partitioning the data.
order_byString(Optional) List of SQL expressions to use in ordering the data.

In window option, main_id is not added by default, it can be any arbitrary list of columns from the input table. So if a feature should be partitioned by main_id, you must add it in the partition_by key.

Output

After running the project, you can view the generated material tables in your Snowflake:

  1. Log in to your Snowflake console.
  2. Click Worksheets from the top navigation bar.
  3. In the left sidebar, click Database and the corresponding Schema to view the list of all tables. You can hover over a table to see the full table name along with its creation date and time.
  4. Write a SQL query like select * from <table_name> and execute it to see the results:
Snowflake console

Window functions

A window function operates on a window (group) of related rows. It performs calculation on a subset of table rows that are connected to the current row in some way. The window function has the ability to access more than just the current row in the query result.

The window function returns one output row for each input row. The values returned are calculated by using values from the sets of rows in that window. A window is defined using a window specification, and is based on three main concepts:

  • Window partitioning, which forms the groups of rows (PARTITION BY clause)
  • Window ordering, which defines an order or sequence of rows within each partition (ORDER BY clause)
  • Window frames, which are defined relative to each row to further restrict the set of rows (ROWS specification). It is also known as the frame clause.

Snowflake does not enforces users to define the cumulative or sliding frames, and considers ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING as the default cumulative window frame. However, you can override this by defining the frame manually.

Example of using the frame_clause:

- entity_var:
    name: first_num_b_order_num_b
    select: first_value(tbl_c.num_b) # Specify frame clause as aggregate window function is used
    from: inputs/tbl_c
    default: -1
    where: tbl_c.num_b >= 10
    window:
        order_by:
        - tbl_c.num_b desc
        frame_clause: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
- entity_var:
    name: first_num_b_order_num_b_rank
    select: rank() # DO NOT specify frame clause as ranking window function is used
    window:
        partition_by:
        - first_num_b_order_num_b > 0
        order_by:
        - first_num_b_order_num_b asc

Questions? Contact us by email or on Slack