Familiarize yourself with the commonly used terms across RudderStack Profiles.
edge_sources field provides the input sources for an identity stitching model. You can specify it in the
models/profiles.yaml file to list the input sources defined in the
Entity refers to an object for which you can create a profile. An entity can be a user, customer, product, or any other object that requires profiling.
entity_var field provides the input sources for the feature table model along with the feature name, column names, description, values, etc. You can specify it in the
models/profiles.yaml file and reference it in other parts of the model. You can also define an
entity_var as a feature.
Features are inputs for the machine learning model. In a general sense, features are pieces of user information we already know. For example, number of days they opened the app in the past week, items they left in the cart, etc.
Inputs refers to the input data sources used to create the material (output) tables in the warehouse. The inputs file (
models/inputs.yaml) lists the tables/views you use as input sources, including the column name and SQL expression for retrieving the values.
You can use data from various input sources such as event stream (loaded from event data), ETL extract (loaded from Cloud Extract), and any existing tables in the warehouse (generated by external tools).
input_var field is similar to
entity_var, except that each value in it is associated with a row of the specified input instead of the row of a feature table.
entity_var acts as a helper column to the feature table, an
input_var cacts as a helper column to the input. Also, an
input_var can’t be defined as a feature nor does it gets stored in the final output table.
Label is the output of the machine learning model and is the metric we want to predict. In our case, it is the unknown user trait we want to know in advance.
Machine learning model
A machine learning model can be thought of as a function that takes in some input parameters and returns an output.
Unlike regular programming, this function is not explicitly defined. Instead, a high level architecture is defined and several pieces are filled by looking at the data. This whole process of filling the gaps is driven by different optimisation algorithms as they try to learn complex patterns in the input data that explain the output.
Materialization refers to the process of creating output tables/views in a warehouse using PB models. You can define the
run_type to create the output table:
run_type: Possible values are:
discrete: Calculates the model result from it’s inputs whenever run (default mode). A SQL model supports only
discrete run type.
incremental: The model reads updates from input sources and results from the previous run. It only updates or inserts data and does not delete anything. The
incremental mode is more efficient. However, only identity stitching model supports it.
When you run the PB models, they produce materials - that is, tables/views in the database that contain the results of that model run. These output tables are known as material tables.
The model’s output is called a prediction. A good model makes predictions that are close to the actual label. You generally need predictions where the labels are not available. In our case, most often the labels come a few days later.
Prediction horizon days
This refer to the number of days in advance when we make the prediction.
For example, statements like “A user is likely to sign-up in the next 30 days, 90 days, etc.” are often time-bound, that is, the predictions are meaningless without the time period.
Profile Builder (PB)
Profile Builder (PB) is a command-line interface (CLI) tool that simplifies data transformation within your warehouse. It generates customer profiles by stitching data together from multiple sources. It takes existing tables or the output of other transformations as input to generate output tables or views based on your specifications.
A PB project is a collection of interdependent warehouse transformations. These transformations are run over the warehouse to query the data for sample outputs, discover features in the warehouse, and more.
Each transformation that can be applied to the warehouse data is called a PB model.
Every PB release supports a specific set of project schemas. A project schema determines the correct layout of a PB project, including the exact keys and their values in all project files.
Training refers to the process of a machine learning model looking at the available data and trying to learn a function that explains the labels.
Once you train a model on historic users, you can use it to make predictions for new users. You need to keep retraining the model as you get new data so that the model continues to learn any emerging patterns in the new users.
Questions? Contact us by email or on