Profiles

Create unified customer profiles with RudderStack Profiles.

This guide walks you through RudderStack Profiles. It lets you build a unified customer profile from all cross-platform data in your warehouse.

Introduction

Companies gather user data across various digital touchpoints like websites, mobile apps, enterprise systems like CRMs, marketing platforms, etc. Users are also identified with multiple identifiers across their product journey, like their email ID, phone number, device ID, anonymous ID, account usernames, and more. In addition, they might have multiple identifiers as they often change their devices and use work and personal emails together.

All of this data is typically scattered across multiple tables within a warehouse. Profiles is RudderStack’s solution to tie all user data from these tables into a unified view, giving you a 360-degree view of the user - often called Customer 360.

success
Customer 360 is best represented as a table with one row per user, with multiple features or attributes computed for that user.

10-step guide to using Profiles

To understand how Profiles works, you will implement a use case by leveraging a sample dataset based on RudderStack’s ecommerce tracking plan. This data is available in Snowflake, which you can access by following the steps in this guide.

You will create a simple Profiles project with the following capabilities:

  • ID stitch anonymous_id, user_id, and email so that the user activity before signup can be tied to their post-signup activity. Also, connect different user IDs if they share the same email.
  • Create the following features:
FeatureDescription
n_active_days_totalNumber of days since the user first visited the app.
orders_completed_in_past_365_daysNumber of orders completed in the past 365 days.
first_seen_dateThe date and time of the first user activity - including their time on the app as an anonymous user.
campaign_source_first_touchCaptured from the campaign data - this is automatically tracked by RudderStack.
avg_session_length_in_sec_last_7_daysAverage length of the user session in the past week.
info
These are some basic features to demonstrate how you can use Profiles. You can add many more features with complex logic to explore its full capabilities.

The 10 steps are divided into the following sections:

Part 1: Setup

1: Install Profile Builder tool

Install the Profile Builder (pb) tool by running the following command:

pip3 install profiles-rudderstack
warning
If you get a Requirements already fulfilled message, you will need to uninstall the Profile Builder tool and then retry the installation. Alternatively, you can run the command pip3 install profiles-rudderstack -U.

2. Set up a warehouse connection

Next, you need to set up a warehouse connection to the RudderStack warehouse having the test dataset for this use case. Go to your terminal and run pb init connection. Then, follow the prompts to set up your connection.

A sample connection for a Snowflake account is as follows:

Enter Connection Name: test
Enter target:  (default:dev):  # Just press enter, leaving it to default
Enter account: vmskhef-customer_sample_data
Enter warehouse: wh_guide
Enter dbname: profiles_sample
Enter schema: ecommerce_output
Enter user: rudder_profiles
Enter password: <password>
Enter role: rudder_profiles_role
Append to /Users/<user_name>/.pb/siteconfig.yaml? [y/N]
yes
warning
  • When prompted for a profile name in the CLI, enter test.
  • We will share the password with you separately.
  • You will see the last line only if you have already set up Profiles earlier. Enter yes in this case.

3. Create a sample project

Running the following command automatically creates a folder called HelloPbProject and a sample project in it:

pb init pb-project

To choose another name for the project folder like ProfilesDemo, run the following command:

pb init pb-project -o ProfilesDemo

Your project folder will have the following structure. These files contain some default prefilled values that you need to modify (explained in the next sections).

Project structure

Part 2: Setting up ID stitching

4. Modify project entities

Go to the pb_project.yaml file in your project folder. This file contains the project name, schema version, project entities, ID types, etc. Under entities, change the id_stitcher name from models/domain_profile_id_stitcher to models/<your_org_name>_user_id_stitcher (ex: models/rudderstack_user_id_stitcher; In this guide, we used models/user_id_stitcher) - this is the name of the table created in the warehouse.

In the pb_project YAML file, the entities and id_types fields play an essential role. All the features are computed at an entity level. A user is one of the most common entity types; other examples are organization/company, products, plans, etc.

Also, an entity can have multiple identifiers like email, phone number, user ID, etc. These identifiers are listed under id_types.

Entities and ID types



In the above example, both the id_types fields are the same. In the first instance, id_types is tied to entities, and the second instance contains the identifier definitions.

5. Connect to your input tables

Check the inputs.yaml file shown below (some lines are hidden). It outlines how to obtain the values for the model from the various tables. Rename the identifies table by replacing rudder_events_production.web.identifies with profiles_sample.ecommerce.identifies. Similarly, rename the tracks table by replacing rudder_events_production.web.tracks with profiles_sample.ecommerce.tracks.

info
identifies and tracks are the test tables in the dataset that you set up in Step 2.
info
Leave the rest of the file unchanged. However, you can add any new identifier for ID stitching or any new tables to the project, as required.
Inputs yaml file

6. Label ID stitcher table

Next, go to the profiles.yaml file where the models are defined. You will find two models here - id_stitcher and feature_table_model.

For ID stitching, retain the first model (model_type: id_stitcher) and comment out the entire second model (model_type: feature_table_model). Also, change the model’s name from domain_profile_id_stitcher to <your_org_name>_user_id_stitcher - the same name used in Step 4.

warning
In the name field, it is important to use the same entity name set in Step 4. This links the ID stitcher in the pb_project.yaml file to the model spec in profiles.yaml.

7. Generate ID stitcher table

Go to your terminal and run the following command:

pb run

The execution takes a couple of minutes. The subsequent runs will be much faster as the PB tool will pick up only the newly-added rows.

Once completed, two things happen:

  • A SQL file gets written in the output/dev/seq_no folder.
  • This SQL is run on the warehouse and a table gets created. Also, the table name is shown in the terminal. An example output is shown below:
2023/01/27 16:19:24 Material Material_user_id_stitcher_27c35415_6 ran in 18.862061136s.

In the above example, the table name is Material_user_id_stitcher_27c35415_6, where:

  • Material is a common prefix.
  • user_id_stitcher is the name defined in Step 4 and Step 6.
  • 27c35415_6 is a suffix generated by the Profile Builder tool. You can ignore this suffix for now.

8. Verify ID stitcher results

Go to the warehouse using the credentials defined in Step 2 to see the actual table. Run the following query after replacing <table_name> with the table name from your CLI output:

SELECT * FROM PROFILES_SAMPLE.ECOMMERCE_OUTPUT.<table_name> ORDER BY main_id, other_id ASC

You will see the following result:

Snowflake table highlighting ID stitching

You can see that five distinct MAIN_ID are created, each representing an actual user. Also, user4 and user5 have the same MAIN_ID as they share the same email - highlighting successful ID stitching.

At this point, the ID stitching is complete.

Part 3: Creating feature tables

9. Define features

To create a feature table, you need to modify the feature table model (model_type: feature_table_model) in profiles.yaml.

Uncomment the lines you commented out in Step 6. Then, modify the name from domain_profile to <your_org_name>_user_profile (in this doc, we used user_profile without the prefix ). Similar to user_id_stitcher, this is the name of the output table that is created in the warehouse.

Also, delete all information present in the vars key as you will write your own features for the model instead, as mentioned in the use case.

info
The Feature definitions section contains details on each feature definition and the associated logic.

Once you define all features, the profiles.yaml file should look something like this:

# This is a sample file, for detailed reference see: https://rudderlabs.github.io/pywht/

models:
  - name: user_id_stitcher
    model_type: id_stitcher
    model_spec:
      validity_time: 24h # 1 day
      entity_key: user
      main_id_type: main_id
      edge_sources:
        - inputs/rsIdentifies
        - inputs/rsTracks

# Remove the section below, if you don't want to generate a feature table

  - name: user_profile
    model_type: feature_table_model
    model_spec:
      validity_time: 24h # 1 day
      entity_key: user
      vars:
        - entity_var:
            name: n_active_days_total # Alias for the temp feature
            select: count(distinct(timestamp::date)) # The sql function to first convert timestamp to date, then count distinct dates
            from: inputs/rsTracks # Refering to the tracks table defined in inputs.yaml file
            description: Number of days since the user first visited the app.

        - entity_var:
            name: orders_completed_in_past_365_days # Alias for the temp feature
            select: count( * ) # The sql function to count all rows from tracks table where event_type = 'order_completed' (see below); If there's an order_id field, you can also use count(distinct order_id) to prevent any duplicate records related to same order
            from: inputs/rsTracks # Refering to the tracks table defined in inputs.yaml file
            where: event = 'order_completed' and datediff(day, date(timestamp), current_date()) <= 365
            description: Number of orders completed in the past 365 days.

        - entity_var:
            name: first_seen_tracks # Alias for the temp feature
            select: min(timestamp::date) # The sql function to get oldest timestamp from tracks table
            from: inputs/rsTracks # Refering to the tracks table defined in inputs.yaml file
            description: First seen timestamp from tracks table.

        - entity_var: # The prev block is repeated with identifies table as source
            name: first_seen_identifies
            select: min(timestamp::date)
            from: inputs/rsIdentifies
            description: First seen timestamp from identifies table.

        - entity_var: # Once min timestamps from both tracks and identifies is defined, we pick the earliest timestamp of both here
            # The prev two are temp features used to derive this feature. 
            # In the profiles.yaml file, we can also see an attribute called features; 
            # We declare this name as a feature, so it doesn't get deleted and is stored in final feature table
            name: first_seen_date # This is the name with which the feature gets stored
            select: to_date(least(first_seen_tracks, first_seen_identifies)) # It uses the names we defined in above entity_vars, and performs sql operations ot get lowest value of these            

        - entity_var:
            name: campaign_source_first_touch # Name of the feature
            select: first_value(context_campaign_source) # Window functions are supported in entity_vars, as long as the value is unique for a given user id (ex: first_value, min, max etc - unlike rank, row_number etc); 
            window: 
              # All window functions in entity_vars are partitioned by main_id by default. It can take only order_by as parameter
              order_by:
                # In this example, we take the oldest occurence of the campaign source, so we sort the rows by timestamp in asc order
                - timestamp asc
            from: inputs/rsIdentifies
            where: context_campaign_source is not null and context_campaign_source != '' # We can filter rows with a where condition.  In this case, we ignore all rows where the campaign source is null or blank - to ensure we ignore organic visits

        - input_var:
            name: session_start_time
            select: min(timestamp::date)
            window:
              partition_by:
                - session_id # Partition on session_id, or any other key than main_id wouldn't have been possible in entity_var
                - main_id # input_vars do not make any assumptions around main_id being the primary key of the table, hence this needs to be given in partition by if required.
            from: inputs/rsTracks

        - input_var:
            name: session_end_time
            select: max(timestamp::date)
            window:
              partition_by:
                - session_id
                - main_id
            from: inputs/rsTracks

        - input_var:
            name: session_length
            select: datediff(second, session_start_time, session_end_time)
            from: inputs/rsTracks

        - entity_var:
            # This is the final query
            name: avg_session_length_in_sec_last_7_days
            select: avg(session_length)
            from: inputs/rsTracks
            where: datediff(day, date(session_start_time), current_date()) <= 7

      features:
        - n_active_days_total   
        - orders_completed_in_past_365_days
        - first_seen_date # Ensures that feature is stored in the final table (unlike first_seen_tracks or first_seen_identifies)
        - campaign_source_first_touch
        - avg_session_length_in_sec_last_7_days

10. Generate and verify feature table

Go to your terminal and run the following command:

pb run

In a few minutes, you will see a similar output as Step 7. However, you will get two SQL files this time - one for the ID stitcher and the other for the feature table. You will also see two tables created in the warehouse.

info
The two SQL files are created in a new folder and are present for debugging purposes. You can check the queries to understand what is happening behind the scenes.

The CLI output will look something like the following:

2023/01/27 21:10:39 Material Material_user_id_stitcher_27c35415_7 ran in 20.173565021s
2023/01/27 21:11:11 Material Material_user_profile_ac394282_7 ran in 26.993571473s

This output confirms successful completion of the job. You can now verify the tables in the warehouse.

The feature table output is shown below:

Feature table

In the above image, you can see that 5 rows are created - one per main_id. For each row, all five features are created with their values computed and filled in. There is one more column valid_at indicating when the feature was computed.

How it works

The pb run command creates a new table of all available models. Also, the Profile Builder creates a snapshot of the latest run with just the names given for the models - user_id_stitcher and user_profile in this case (or <your_org_name>_user_id_stitcher and <your_org_name>_user_profile based on the org_name prefix given). This ensures that the names are more predictable for using them in any downstream application.

Feature definitions

info
You will write each feature in the entity_var field. It can be considered as a declarative form of SQL to define the model features.

You can define the features in profiles.yaml as follows:

  • n_active_days_total: This is a count of the distinct active days (when the user visited the app). The presence of a tracks event is a good indicator for this as every user visit triggers some events from the customer. This feature is defined as follows:
vars:
  - entity_var:
      name: n_active_days_total # Alias for the temp feature
      select: count(distinct(timestamp::date)) # SQL function to first convert timestamp to date, then count the distinct dates
      from: inputs/rsTracks # Refering to the tracks table defined in the inputs.yaml file
      description: Number of days since the user first visited the app.
features:
  # This block is required to declare that the feature needs to be stored in feature table.
  # Often, we may want entity_vars that are temp features, which are not required later (see below).
  # Such vars are not added to the features list.
  - n_active_days_total
  • orders_completed_in_past_365_days: This feature computes the distinct orders completed in the past 365 days. Often, you may want to apply some filters to the table; you can do so with a where clause:
vars:
  - entity_var:
      name: orders_completed_in_past_365_days # Alias for the temp feature
      select: count( * ) # The SQL function to count all rows from tracks table where event_type = 'order_completed' (see below)
      # If there's an order_id field, you can also use count(distinct order_id) to prevent any duplicate records related to same order
      from: inputs/rsTracks # Refering to the tracks table defined in inputs.yaml file
      where: event = 'order_completed' and datediff(day, date(timestamp), current_date()) <= 365
      description: Number of orders completed in the past 365 days.
features:
  - orders_completed_in_past_365_days
warning
This feature would be a continuation to the vars block mentioned above; the vars key need not repeat again. The same applies to all features described below.
  • first_seen_date: Takes the earliest timestamp from both tracks and identifies. To do so, select both of these tables separately and then take the earliest value:
vars:
  - entity_var:
      name: first_seen_tracks # Alias for the temp feature
      select: min(timestamp::date) # The SQL function to get oldest timestamp from the tracks table
      from: inputs/rsTracks # Refering to the tracks table defined in inputs.yaml file
      description: First seen timestamp from tracks table.
  - entity_var: # The prev block is repeated with identifies table as source
      name: first_seen_identifies
      select: min(timestamp::date)
      from: inputs/rsIdentifies
      description: First seen timestamp from identifies table
  - entity_var:
      # Once min timestamps from both tracks and identifies are defined, we pick the earliest timestamp of both here.
	    # The prev two are temp features used to derive this feature. 
      # In the profiles.yaml file, we can also see an attribute called features; 
      # We declare this name as a feature, so it doesn't get deleted and is stored in final feature table
      name: first_seen_date # This is the name with which the feature gets stored
      select: to_date(least(first_seen_tracks, first_seen_identifies))
      # It uses the names we defined in above entity_vars, and performs sql operations ot get lowest value of these
features:
  - first_seen_date   # Ensures that the feature is stored in the final table (unlike first_seen_tracks or first_seen_identifies)
  • campaign_source_first_touch: From the identifies table, picks the earliest available campaign_source. Sometimes, the first touch may be organic (null source) but the same user may arrive later through a campaign. In such cases, it is desirable to capture the first non-organic touch. For this, you can filter out the empty rows through a where operator:
vars:
  - entity_var:
      name: campaign_source_first_touch # Name of the feature
      select: first_value(context_campaign_source) # Window functions are supported in entity_vars, as long as the value is unique for a given user id
      window: 
      # All the window functions in entity_vars are partitioned by main_id by default. It can take only order_by as a parameter.
      order_by:
      # In this example, we take the oldest occurence of the campaign source, so we sort the rows by timestamp in ascending order
        - timestamp asc
      from: inputs/rsIdentifies
      where: context_campaign_source is not null and context_campaign_source != ''
      # We can filter rows with a where condition.  In this case, we ignore all rows where the campaign source is null or blank - to ensure we ignore organic visits
features:
  - campaign_source_first_touch  
  • avg_session_length_in_sec_last_7_days: RudderStack captures a session_id and a session_start_time by default in the page and track calls. You can use this session_id to compute each session length and their average.

To add a time-bounded feature, for example, only the sessions in the past one week, entity_var is not sufficient. The output of entity_var is assumed to be a single row per user_id / main_id. However, for calculating the average session length, you need to compute session length for each session, followed by averaging it across all sessions. To do so, you can use input_var. In input_var, you can apply transformations on (a copy of) the input table itself, where row-level modifications are possible. You can also use the window functions with arbitrary partition_by clauses too:

vars:
  - input_var:
      name: session_start_time
      select: min(timestamp::date)
      window:
          partition_by:
            - session_id # Partition on session_id, or any other key than main_id wouldn't have been possible in entity_var
            - main_id # input_vars do not make any assumptions around main_id being the primary key of the table, hence this needs to be given in partition by if required.
      from: inputs/rsTracks
  - input_var:
      name: session_end_time
      select: max(timestamp::date)
      window:
        partition_by:
          - session_id
          - main_id
      from: inputs/rsTracks
  - input_var:
      name: session_length
      select: datediff(second, session_start_time, session_end_time)
      from: inputs/rsTracks
  - entity_var:
      # This is the final query
      name: avg_session_length_in_sec_last_7_days
      select: avg(session_length)
      from: inputs/rsTracks
      where: datediff(day, date(session_start_time), current_date()) <= 7
features:
  - avg_session_length_in_sec_last_7_days

Questions? Contact us by email or on Slack