Create unified customer profiles with RudderStack Profiles.
This guide walks you through RudderStack Profiles. It lets you build a unified customer profile from all cross-platform data in your warehouse.
Introduction
Companies gather user data across various digital touchpoints like websites, mobile apps, enterprise systems like CRMs, marketing platforms, etc. Users are also identified with multiple identifiers across their product journey, like their email ID, phone number, device ID, anonymous ID, account usernames, and more. In addition, they might have multiple identifiers as they often change their devices and use work and personal emails together.
All of this data is typically scattered across multiple tables within a warehouse. Profiles is RudderStack’s solution to tie all user data from these tables into a unified view, giving you a 360-degree view of the user - often called Customer 360.
Customer 360 is best represented as a table with one row per user, with multiple features or attributes computed for that user.
10-step guide to using Profiles
To understand how Profiles works, you will implement a use case by leveraging a sample dataset based on RudderStack’s ecommerce tracking plan. This data is available in Snowflake, which you can access by following the steps in this guide.
You will create a simple Profiles project with the following capabilities:
ID stitch anonymous_id, user_id, and email so that the user activity before signup can be tied to their post-signup activity. Also, connect different user IDs if they share the same email.
Create the following features:
Feature
Description
n_active_days_total
Number of days since the user first visited the app.
orders_completed_in_past_365_days
Number of orders completed in the past 365 days.
first_seen_date
The date and time of the first user activity - including their time on the app as an anonymous user.
campaign_source_first_touch
Captured from the campaign data - this is automatically tracked by RudderStack.
avg_session_length_in_sec_last_7_days
Average length of the user session in the past week.
These are some basic features to demonstrate how you can use Profiles. You can add many more features with complex logic to explore its full capabilities.
The 10 steps are divided into the following sections:
Install the Profile Builder (pb) tool by running the following command:
pip3installprofiles-rudderstack
If you get a Requirements already fulfilled message, you will need to uninstall the Profile Builder tool and then retry the installation. Alternatively, you can run the command pip3 install profiles-rudderstack -U.
2. Set up a warehouse connection
Next, you need to set up a warehouse connection to the RudderStack warehouse having the test dataset for this use case. Go to your terminal and run pb init connection. Then, follow the prompts to set up your connection.
A sample connection for a Snowflake account is as follows:
Enter Connection Name: test
Enter target: (default:dev): # Just press enter, leaving it to default
Enter account: vmskhef-customer_sample_data
Enter warehouse: wh_guide
Enter dbname: profiles_sample
Enter schema: ecommerce_output
Enter user: rudder_profiles
Enter password: <password>
Enter role: rudder_profiles_role
Append to /Users/<user_name>/.pb/siteconfig.yaml? [y/N]
yes
When prompted for a profile name in the CLI, enter test.
We will share the password with you separately.
You will see the last line only if you have already set up Profiles earlier. Enter yes in this case.
3. Create a sample project
Running the following command automatically creates a folder called HelloPbProject and a sample project in it:
pbinitpb-project
To choose another name for the project folder like ProfilesDemo, run the following command:
pbinitpb-project-oProfilesDemo
Your project folder will have the following structure. These files contain some default prefilled values that you need to modify (explained in the next sections).
Part 2: Setting up ID stitching
4. Modify project entities
Go to the pb_project.yaml file in your project folder. This file contains the project name, schema version, project entities, ID types, etc. Under entities, change the id_stitcher name from models/domain_profile_id_stitcher to models/<your_org_name>_user_id_stitcher (ex: models/rudderstack_user_id_stitcher; In this guide, we used models/user_id_stitcher) - this is the name of the table created in the warehouse.
In the pb_project YAML file, the entities and id_types fields play an essential role. All the features are computed at an entity level. A user is one of the most common entity types; other examples are organization/company, products, plans, etc.
Also, an entity can have multiple identifiers like email, phone number, user ID, etc. These identifiers are listed under id_types.
In the above example, both the id_types fields are the same. In the first instance, id_types is tied to entities, and the second instance contains the identifier definitions.
5. Connect to your input tables
Check the inputs.yaml file shown below (some lines are hidden). It outlines how to obtain the values for the model from the various tables. Rename the identifies table by replacing rudder_events_production.web.identifies with profiles_sample.ecommerce.identifies. Similarly, rename the tracks table by replacing rudder_events_production.web.tracks with profiles_sample.ecommerce.tracks.
identifies and tracks are the test tables in the dataset that you set up in Step 2.
Leave the rest of the file unchanged. However, you can add any new identifier for ID stitching or any new tables to the project, as required.
6. Label ID stitcher table
Next, go to the profiles.yaml file where the models are defined. You will find two models here - id_stitcher and feature_table_model.
For ID stitching, retain the first model (model_type: id_stitcher) and comment out the entire second model (model_type: feature_table_model). Also, change the model’s name from domain_profile_id_stitcher to <your_org_name>_user_id_stitcher - the same name used in Step 4.
In the name field, it is important to use the same entity name set in Step 4. This links the ID stitcher in the pb_project.yaml file to the model spec in profiles.yaml.
7. Generate ID stitcher table
Go to your terminal and run the following command:
pbrun
The execution takes a couple of minutes. The subsequent runs will be much faster as the PB tool will pick up only the newly-added rows.
Once completed, two things happen:
A SQL file gets written in the output/dev/seq_no folder.
This SQL is run on the warehouse and a table gets created. Also, the table name is shown in the terminal. An example output is shown below:
2023/01/27 16:19:24 Material Material_user_id_stitcher_27c35415_6 ran in 18.862061136s.
In the above example, the table name is Material_user_id_stitcher_27c35415_6, where:
Material is a common prefix.
user_id_stitcher is the name defined in Step 4 and Step 6.
27c35415_6 is a suffix generated by the Profile Builder tool. You can ignore this suffix for now.
8. Verify ID stitcher results
Go to the warehouse using the credentials defined in Step 2 to see the actual table. Run the following query after replacing <table_name> with the table name from your CLI output:
You can see that five distinct MAIN_ID are created, each representing an actual user. Also, user4 and user5 have the same MAIN_ID as they share the same email - highlighting successful ID stitching.
At this point, the ID stitching is complete.
Part 3: Creating feature tables
9. Define features
To create a feature table, you need to modify the feature table model (model_type: feature_table_model) in profiles.yaml.
Uncomment the lines you commented out in Step 6. Then, modify the name from domain_profile to <your_org_name>_user_profile (in this doc, we used user_profile without the prefix ). Similar to user_id_stitcher, this is the name of the output table that is created in the warehouse.
Also, delete all information present in the vars key as you will write your own features for the model instead, as mentioned in the use case.
The Feature definitions section contains details on each feature definition and the associated logic.
Once you define all features, the profiles.yaml file should look something like this:
# This is a sample file, for detailed reference see: https://rudderlabs.github.io/pywht/models:- name:user_id_stitchermodel_type:id_stitchermodel_spec:validity_time:24h# 1 dayentity_key:usermain_id_type:main_idedge_sources:- inputs/rsIdentifies- inputs/rsTracks# Remove the section below, if you don't want to generate a feature table- name:user_profilemodel_type:feature_table_modelmodel_spec:validity_time:24h# 1 dayentity_key:uservars:- entity_var:name:n_active_days_total# Alias for the temp featureselect:count(distinct(timestamp::date))# The sql function to first convert timestamp to date, then count distinct datesfrom:inputs/rsTracks# Refering to the tracks table defined in inputs.yaml filedescription:Number of days since the user first visited the app.- entity_var:name:orders_completed_in_past_365_days# Alias for the temp featureselect:count( * )# The sql function to count all rows from tracks table where event_type = 'order_completed' (see below); If there's an order_id field, you can also use count(distinct order_id) to prevent any duplicate records related to same orderfrom:inputs/rsTracks# Refering to the tracks table defined in inputs.yaml filewhere:event = 'order_completed' and datediff(day, date(timestamp), current_date()) <= 365description:Number of orders completed in the past 365 days.- entity_var:name:first_seen_tracks# Alias for the temp featureselect:min(timestamp::date)# The sql function to get oldest timestamp from tracks tablefrom:inputs/rsTracks# Refering to the tracks table defined in inputs.yaml filedescription:First seen timestamp from tracks table.- entity_var:# The prev block is repeated with identifies table as sourcename:first_seen_identifiesselect:min(timestamp::date)from:inputs/rsIdentifiesdescription:First seen timestamp from identifies table.- entity_var:# Once min timestamps from both tracks and identifies is defined, we pick the earliest timestamp of both here# The prev two are temp features used to derive this feature. # In the profiles.yaml file, we can also see an attribute called features; # We declare this name as a feature, so it doesn't get deleted and is stored in final feature tablename:first_seen_date# This is the name with which the feature gets storedselect:to_date(least(first_seen_tracks, first_seen_identifies))# It uses the names we defined in above entity_vars, and performs sql operations ot get lowest value of these - entity_var:name:campaign_source_first_touch# Name of the featureselect: first_value(context_campaign_source) # Window functions are supported in entity_vars, as long as the value is unique for a given user id (ex:first_value, min, max etc - unlike rank, row_number etc); window:# All window functions in entity_vars are partitioned by main_id by default. It can take only order_by as parameterorder_by:# In this example, we take the oldest occurence of the campaign source, so we sort the rows by timestamp in asc order- timestamp ascfrom:inputs/rsIdentifieswhere:context_campaign_source is not null and context_campaign_source != ''# We can filter rows with a where condition. In this case, we ignore all rows where the campaign source is null or blank - to ensure we ignore organic visits- input_var:name:session_start_timeselect:min(timestamp::date)window:partition_by:- session_id# Partition on session_id, or any other key than main_id wouldn't have been possible in entity_var- main_id# input_vars do not make any assumptions around main_id being the primary key of the table, hence this needs to be given in partition by if required.from:inputs/rsTracks- input_var:name:session_end_timeselect:max(timestamp::date)window:partition_by:- session_id- main_idfrom:inputs/rsTracks- input_var:name:session_lengthselect:datediff(second, session_start_time, session_end_time)from:inputs/rsTracks- entity_var:# This is the final queryname:avg_session_length_in_sec_last_7_daysselect:avg(session_length)from:inputs/rsTrackswhere:datediff(day, date(session_start_time), current_date()) <= 7features:- n_active_days_total - orders_completed_in_past_365_days- first_seen_date# Ensures that feature is stored in the final table (unlike first_seen_tracks or first_seen_identifies)- campaign_source_first_touch- avg_session_length_in_sec_last_7_days
10. Generate and verify feature table
Go to your terminal and run the following command:
pbrun
In a few minutes, you will see a similar output as Step 7. However, you will get two SQL files this time - one for the ID stitcher and the other for the feature table. You will also see two tables created in the warehouse.
The two SQL files are created in a new folder and are present for debugging purposes. You can check the queries to understand what is happening behind the scenes.
The CLI output will look something like the following:
2023/01/27 21:10:39 Material Material_user_id_stitcher_27c35415_7 ran in 20.173565021s
2023/01/27 21:11:11 Material Material_user_profile_ac394282_7 ran in 26.993571473s
This output confirms successful completion of the job. You can now verify the tables in the warehouse.
The feature table output is shown below:
In the above image, you can see that 5 rows are created - one per main_id. For each row, all five features are created with their values computed and filled in. There is one more column valid_at indicating when the feature was computed.
How it works
The pb run command creates a new table of all available models. Also, the Profile Builder creates a snapshot of the latest run with just the names given for the models - user_id_stitcher and user_profile in this case (or <your_org_name>_user_id_stitcher and <your_org_name>_user_profile based on the org_name prefix given). This ensures that the names are more predictable for using them in any downstream application.
Feature definitions
You will write each feature in the entity_var field. It can be considered as a declarative form of SQL to define the model features.
You can define the features in profiles.yaml as follows:
n_active_days_total: This is a count of the distinct active days (when the user visited the app). The presence of a tracks event is a good indicator for this as every user visit triggers some events from the customer. This feature is defined as follows:
vars:- entity_var:name:n_active_days_total# Alias for the temp featureselect:count(distinct(timestamp::date))# SQL function to first convert timestamp to date, then count the distinct datesfrom:inputs/rsTracks# Refering to the tracks table defined in the inputs.yaml filedescription:Number of days since the user first visited the app.features:# This block is required to declare that the feature needs to be stored in feature table.# Often, we may want entity_vars that are temp features, which are not required later (see below).# Such vars are not added to the features list.- n_active_days_total
orders_completed_in_past_365_days: This feature computes the distinct orders completed in the past 365 days. Often, you may want to apply some filters to the table; you can do so with a where clause:
vars:- entity_var:name:orders_completed_in_past_365_days# Alias for the temp featureselect:count( * )# The SQL function to count all rows from tracks table where event_type = 'order_completed' (see below)# If there's an order_id field, you can also use count(distinct order_id) to prevent any duplicate records related to same orderfrom:inputs/rsTracks# Refering to the tracks table defined in inputs.yaml filewhere:event = 'order_completed' and datediff(day, date(timestamp), current_date()) <= 365description:Number of orders completed in the past 365 days.features:- orders_completed_in_past_365_days
This feature would be a continuation to the vars block mentioned above; the vars key need not repeat again. The same applies to all features described below.
first_seen_date: Takes the earliest timestamp from both tracks and identifies. To do so, select both of these tables separately and then take the earliest value:
vars:- entity_var:name:first_seen_tracks# Alias for the temp featureselect:min(timestamp::date)# The SQL function to get oldest timestamp from the tracks tablefrom:inputs/rsTracks# Refering to the tracks table defined in inputs.yaml filedescription:First seen timestamp from tracks table.- entity_var:# The prev block is repeated with identifies table as sourcename:first_seen_identifiesselect:min(timestamp::date)from:inputs/rsIdentifiesdescription:First seen timestamp from identifies table- entity_var:# Once min timestamps from both tracks and identifies are defined, we pick the earliest timestamp of both here.# The prev two are temp features used to derive this feature. # In the profiles.yaml file, we can also see an attribute called features; # We declare this name as a feature, so it doesn't get deleted and is stored in final feature tablename:first_seen_date# This is the name with which the feature gets storedselect:to_date(least(first_seen_tracks, first_seen_identifies))# It uses the names we defined in above entity_vars, and performs sql operations ot get lowest value of thesefeatures:- first_seen_date # Ensures that the feature is stored in the final table (unlike first_seen_tracks or first_seen_identifies)
campaign_source_first_touch: From the identifies table, picks the earliest available campaign_source. Sometimes, the first touch may be organic (null source) but the same user may arrive later through a campaign. In such cases, it is desirable to capture the first non-organic touch. For this, you can filter out the empty rows through a where operator:
vars:- entity_var:name:campaign_source_first_touch# Name of the featureselect:first_value(context_campaign_source)# Window functions are supported in entity_vars, as long as the value is unique for a given user idwindow:# All the window functions in entity_vars are partitioned by main_id by default. It can take only order_by as a parameter.order_by:# In this example, we take the oldest occurence of the campaign source, so we sort the rows by timestamp in ascending order- timestamp ascfrom:inputs/rsIdentifieswhere:context_campaign_source is not null and context_campaign_source != ''# We can filter rows with a where condition. In this case, we ignore all rows where the campaign source is null or blank - to ensure we ignore organic visitsfeatures:- campaign_source_first_touch
avg_session_length_in_sec_last_7_days: RudderStack captures a session_id and a session_start_time by default in the page and track calls. You can use this session_id to compute each session length and their average.
To add a time-bounded feature, for example, only the sessions in the past one week, entity_var is not sufficient. The output of entity_var is assumed to be a single row per user_id / main_id. However, for calculating the average session length, you need to compute session length for each session, followed by averaging it across all sessions. To do so, you can use input_var. In input_var, you can apply transformations on (a copy of) the input table itself, where row-level modifications are possible. You can also use the window functions with arbitrary partition_by clauses too:
vars:- input_var:name:session_start_timeselect:min(timestamp::date)window:partition_by:- session_id# Partition on session_id, or any other key than main_id wouldn't have been possible in entity_var- main_id# input_vars do not make any assumptions around main_id being the primary key of the table, hence this needs to be given in partition by if required.from:inputs/rsTracks- input_var:name:session_end_timeselect:max(timestamp::date)window:partition_by:- session_id- main_idfrom:inputs/rsTracks- input_var:name:session_lengthselect:datediff(second, session_start_time, session_end_time)from:inputs/rsTracks- entity_var:# This is the final queryname:avg_session_length_in_sec_last_7_daysselect:avg(session_length)from:inputs/rsTrackswhere:datediff(day, date(session_start_time), current_date()) <= 7features:- avg_session_length_in_sec_last_7_days
This site uses cookies to improve your experience while you navigate through the website. Out of
these
cookies, the cookies that are categorized as necessary are stored on your browser as they are as
essential
for the working of basic functionalities of the website. We also use third-party cookies that
help
us
analyze and understand how you use this website. These cookies will be stored in your browser
only
with
your
consent. You also have the option to opt-out of these cookies. But opting out of some of these
cookies
may
have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This
category only includes cookies that ensures basic functionalities and security
features of the website. These cookies do not store any personal information.
This site uses cookies to improve your experience. If you want to
learn more about cookies and why we use them, visit our cookie
policy. We'll assume you're ok with this, but you can opt-out if you wish Cookie Settings.