Structure of a Profile Builder project and the parameters used in different files.
Sample project
This sample project considers multiple user identifiers in different warehouse tables to ties them together to create a unified user profile. The following sections describe how to define your PB project files:
Project detail
The pb_project.yaml file defines the project details such as name, schema version, connection name and the entities which represent different identifiers.
You can define all the identifiers from different input sources you want to stitch together as a rudder_id (main_id in this example):
# Project namename:sample_id_stitching# Project's yaml schema versionschema_version:42# Warehouse connectionconnection:test# Allow inputs without timestampsinclude_untimed:true# Folder containing modelsmodel_folders:- models# Entities in this project and their ids.entities:- name:userid_stitcher:models/user_id_stitcher# name of ID stitcher model, prefixed with relative path of models folder where it's definedid_types:- main_id # You need to add ``main_id`` to the list only if you have defined ``main_id_type:main_id`` in the id stitcher buildspec.- user_id# one of the identifier from your data source.- emailid_types:- name:main_id- name:user_idfilters:# Multiple filters like exclude, include can be added using value or regex match.- type:excludevalue:""- name:emailfilters:# An include filter example to consider values matching provided regex. Note that this will automatically be anchored by sql to include the beginning and ending regex special chars (^ and $).- type:includeregex:".+@.+"# Automatically anchored, so equivalent to "^.+@.+$"
Input
The input file (models/inputs.yaml) file includes the input table references and corresponding SQL for the above-mentioned entities:
inputs:- name:rsIdentifiescontract:# constraints that a model adheres tois_optional:falseis_event_stream:truewith_entity_ids:- userwith_columns:- name:timestamp- name:user_id- name:anonymous_id- name:emailapp_defaults:table:rudder_events_production.web.identifies# one of the WH table RudderStack generates when processing identify or track events.occurred_at_col:timestampids:- select:"user_id"# kind of identity sql to pick this column from above table.type:user_identity:user# as defined in project fileto_default_stitcher:true- select:"anonymous_id"type:anonymous_identity:userto_default_stitcher:true- select:"lower(email)"# can use sql.type:emailentity:userto_default_stitcher:true- name:rsTrackscontract:is_optional:falseis_event_stream:truewith_entity_ids:- userwith_columns:- name:timestamp- name:user_id- name:anonymous_idapp_defaults:table:rudder_events_production.web.tracks# another table in WH maintained by RudderStack processing track events.occurred_at_col:timestampids:- select:"user_id"type:user_identity:userto_default_stitcher:true- select:"anonymous_id"type:anonymous_identity:userto_default_stitcher:true
As seen in the above file, you can use SQL to achieve some complex scenario as well.
Model
Profiles Identity stitching model maps and unifies all the specified identifiers (in pb_project.yaml file) across different platforms. It tracks the user journey uniquely across all the data sources and stitches them together to a rudder_id.
A sample profiles.yaml file specifying an identity stitching model (user_id_stitcher) with relevant inputs:
Specifies the validity of the model with respect to its timestamp. For example, a model run as part of a scheduled nightly job for 2009-10-23 00:00:00 UTC with validity_time: 24h would still be considered potentially valid and usable for any run requests, which do not require precise timestamps between 2009-10-23 00:00:00 UTC and 2009-10-24 00:00:00 UTC. This specifies the validity of generated feature table. Once the validity is expired, scheduling takes care of generating new tables. For example: 24h for 24 hours, 30m for 30 minutes, 3d for 3 days
entity_key
String
Specifies the relevant entity from your input.yaml file. For example, here it should be set to user.
materialization
List
Adds the key run_type: incremental to run the project in incremental mode. This mode considers row inserts and updates from the edge_sources input. These are inferred by checking the timestamp column for the next run. One can provide buffer time to consider any lag in data in the warehouse for the next incremental run like if new rows are added during the time of its run. If you do not specify this key then it’ll default to run_type: discrete.
incremental_timedelta
List
(Optional )If materialization key is set to run_type: incremental, then this field sets how far back data should be fetched prior to the previous material for a model (to handle data lag, for example). The default value is 4 days.
main_id_type
ProjectRef
(Optional) ID type reserved for the output of the identity stitching model, often set to main_id. It must not be used in any of the inputs and must be listed as an id type for the entity being stitched. If you do not set it, it defaults to rudder_id. Do not add this key unless it’s explicitly required, like if you want your identity stitcher table’s main_id column to be called main_id.
edge_sources
List
Specifies inputs for the identity stitching model as mentioned in the inputs.yaml file.
Output
After running the project, you can view the generated material tables in your Snowflake:
Log in to your Snowflake console.
Click Worksheets from the top navigation bar.
In the left sidebar, click Database and the corresponding Schema to view the list of all tables. You can hover over a table to see the full table name along with its creation date and time.
Write a SQL query like select * from <table_name> limit 10; and execute it to see the results:
This site uses cookies to improve your experience while you navigate through the website. Out of
these
cookies, the cookies that are categorized as necessary are stored on your browser as they are as
essential
for the working of basic functionalities of the website. We also use third-party cookies that
help
us
analyze and understand how you use this website. These cookies will be stored in your browser
only
with
your
consent. You also have the option to opt-out of these cookies. But opting out of some of these
cookies
may
have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This
category only includes cookies that ensures basic functionalities and security
features of the website. These cookies do not store any personal information.
This site uses cookies to improve your experience. If you want to
learn more about cookies and why we use them, visit our cookie
policy. We'll assume you're ok with this, but you can opt-out if you wish Cookie Settings.