Structure of a Profile Builder project and the parameters used in different files.
Sample project
The following sections describe how to define your PB project files:
Project detail
The pb_project.yaml file defines the project details such as name, schema version, connection name and the entities which represent different identifiers.
You can define all the identifiers from different input sources you want to stitch together as a single ID (main_id in this example):
Profiles SQL model lets you write custom SQL queries to achieve advanced use-cases to create desired output tables.
A sample profiles.yaml file specifying a SQL model (test_sql):
models:- name:test_sqlmodel_type:sql_templatemodel_spec:validity_time:24h# 1 daymaterialization:// optionalrun_type:discrete // optional [discrete, incremental]single_sql:| {%- with input1 = this.DeRef("inputs/tbl_a") -%}
select id1 as new_id1, id2 as new_id2, {{input1}}.*
from {{input1}}
{%- endwith -%}occurred_at_col:insert_ts // optionalids:- select:"new_id1"type:test_identity:userto_default_stitcher:true- select:"new_id2"type:test_identity:userto_default_stitcher:true- select:"id3"type:test_identity:userto_default_stitcher:true
Model specification fields
Field
Data type
Description
validity_time
Time
Time Specifies the validity of the model with respect to its timestamp. For example, a model run as part of a scheduled nightly job for 2009-10-23 00:00:00 UTC with validity_time: 24h would still be considered potentially valid and usable for any run requests, which do not require precise timestamps between 2009-10-23 00:00:00 UTC and 2009-10-24 00:00:00 UTC. This specifies the validity of generated feature table. Once the validity is expired, scheduling takes care of generating new tables. For example: 24h for 24 hours, 30m for 30 minutes, 3d for 3 days.
materialization
List
Adds the key run_type: incremental to run the project in incremental mode. This mode considers row inserts and updates from the edge_sources input. These are inferred by checking the timestamp column for the next run. One can provide buffer time to consider any lag in data in the warehouse for the next incremental run like if new rows are added during the time of its run. If you do not specify this key then it’ll default to run_type: discrete.
single_sql
List
Specifies the SQL template which must evaluate to a single SELECT SQL statement. After execution, it should produce a dataset which will materialize based on the provided materialization.
multi-sql
List
Specifies the SQL template which can evaluate to multiple SQL statements. One of these SQL statements (typically the last one) must be a CREATE statement which shall be responsible for materializing the model into a table.
Note: You should set only one of single_sql or multi_sql.
occurred_at_col
List
Name of the column which contains the timestamp value in the output of sql template.
ids
List
Specifies the list of all IDs present in the source table along with their column names (or column SQL expressions). It is required in case you want to use SQL models as an input to the input_var or entity_var fields.
Output
After running the project, you can view the generated material tables in your Snowflake:
Log in to your Snowflake console.
Click Worksheets from the top navigation bar.
In the left sidebar, click Database and the corrosponding Schema to view the list of all tables. You can hover over a table to see the full table name along with its creation date and time.
Write a SQL query like select * from <table_name> limit 10; and execute it to see the results:
SQL template
You can pass custom SQL queries to the single_sql or multi_sql fields, which is also known as a SQL template. It provides the flexibility to write custom SQL by refering to any of the input sources listed in the inputs.yaml or any model listed in models/profiles.yaml.
The SQL templates follow a set query syntax which serves the purpose of creating a model. Follow the below rules to write SQL templates:
Avoid circular referencing while referencing the models. For example, sql_model_a references sql_model_b and sql_model_b references sql_model_a.
Use timestamp variable (refers to the start time of the current run) to filter new events.
this refers to the current model’s material. You can use the following methods to access the material properties available for this:
DeRef("path/to/model"): Use this syntax {{ this.DeRef("path/to/model") }} to refer to any model and return a database object corresponding to that model. The database object, in return, gives the actual name of the table/view in the warehouse. Then, generate the output, for example:
GetMaterialization(): Returns a structure with two fields: MaterializationSpec{OutputType, RunType}.
OutputType: You must use OutputType with ToSQL() method: For example, CREATE OR REPLACE {{this.GetMaterialization().OutputType.ToSQL()}} {{this.GetSelectTargetSQL()}} AS ...
RunType: For example, this.GetMaterialization().RunType
Model Contracts
With model contracts, you can declare constraints that the model adheres to. A model having a dependency on another model would also need to declare a contract specifying what columns and entities the input model must have. For contract validation, these columns should be present in the referenced model.
For an input of a project like a library project, the model contract is used to enforce constraints on tables/views that get wired to it downstream.
In SQL model, the contract would contain all the columns from IDs and features.
Each internal model also publishes the contract it promises to adhere to. Suppose rsSessionTable has an input shopify_session_features.
Model contracts enable rsSessionTable to specify the constraints that shopify_session_features must adhere to.
Here, rsSessionTable declares that its input shopify_session_features must have columns user_id and anonymous_id. This helps in improving data quality and error handling.
Internally, this requested contract is validated against shopify_session_features’s actual contract. For validation to pass, input_shopify_session_features_contract must be a subset of shopify_session_features’s published contract.
This enables more comprehensive static and dynamic validations of our projects.
This site uses cookies to improve your experience while you navigate through the website. Out of
these
cookies, the cookies that are categorized as necessary are stored on your browser as they are as
essential
for the working of basic functionalities of the website. We also use third-party cookies that
help
us
analyze and understand how you use this website. These cookies will be stored in your browser
only
with
your
consent. You also have the option to opt-out of these cookies. But opting out of some of these
cookies
may
have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This
category only includes cookies that ensures basic functionalities and security
features of the website. These cookies do not store any personal information.
This site uses cookies to improve your experience. If you want to
learn more about cookies and why we use them, visit our cookie
policy. We'll assume you're ok with this, but you can opt-out if you wish Cookie Settings.