You are viewing documentation for an older version.

Click here to view the latest documentation.


Familiarize yourself with the commonly used terms across RudderStack Profiles.


A Cohort refers to a subset of instances of an entity meeting a specified set of characteristics, behaviors, or attributes. Using cohorts, you can define core customer segments and drive targeted marketing campaigns and deliver personalized experiences.

For example, you can define a cohort for users who purchased items in the last 90 days, users based in a specific geographical location, etc.

Custom Models (Python)

One can build custom Python models for ML by downloading pre-defined Python templates. The results are usually saved as attributes of related entities (for example, churnProbability).

Python models and pymodels work differently. In case of Python, developers (both RudderStack and external) develop new model types in python using profiles-rudderstack package. An example python package implementing a number of Python models is profiles-pycorelib.

Edge sources

The edge_sources field provides the input sources for an identity stitching model. You can specify it in the models/profiles.yaml file to list the input sources defined in the inputs.yaml file.


Entity refers to a digital representation of a class of real world distinct objects for which you can create a profile. An entity can be a user, account, customer, product, or any other object that requires profiling.

Entity var/Entity features

These are various attributes related to an entity whose profile you are trying to create. For example, they can be name, city, LastVisitTimestamp, etc. for the user entity. Each attribute is called an entity_var, and it is derived by performing calculation or aggregation on a set of values. Together, all the attributes create a complete picture of the entity. By default, every entity_var gets stored as a feature, such as days_active, last_seen, etc.

Feature Views

If the features/traits of an entity are spread across multiple entity vars and ML models, you can use Feature Views to get them together into a single view. These models are usually defined in pb_project.yaml file by creating entries under feature_views key with corresponding entity.


Features are inputs for the machine learning model. In a general sense, features are pieces of user information we already know. For example, number of days they opened the app in the past week, items they left in the cart, etc.

Feature tables (legacy)

Feature tables are the outputs based on events, user attributes, and other defined criteria across any data set in your data warehouse. You can define models that can create feature tables for users with ID stitching, ML notebooks and external sources, etc.

ID Stitcher

Data usually comes from different sources and these sources may assign different IDs. To track a user’s journey (or any other entity) uniquely across all these data sources, we need to stitch together all these IDs. ID stitching helps map different IDs of the same user (or any other entity) to a single canonical ID. It does this by doing connected component analysis over the Id-Id edge graph specified in its configuration.

ID Collator

ID Collator is similar to ID Stitcher. It is used when entity has only a single ID type associated (for example, session IDs). In these cases, connected component analysis is not required and we use a simpler model type called ID Collator. It consolidates all entity IDs from multiple input tables into a single collated list.


Inputs refers to the input data sources used to create the material (output) tables in the warehouse. The inputs file (models/inputs.yaml) lists the tables/views you use as input sources, including the column name and SQL expression for retrieving the values.

You can use data from various input sources such as event stream (loaded from event data), ETL extract (loaded from Cloud Extract), and any existing tables in the warehouse (generated by external tools).

Input var

Instead of a single value per entity ID, it represents a single value per row of an input model. Think of it as representing addition of an additional column to an input model. It can be used to define entity features. However, it is not itself an entity feature because it does not represent a single value per entity ID.


Label is the output of the machine learning model and is the metric we want to predict. In our case, it is the unknown user trait we want to know in advance.

Machine learning model

A machine learning model can be thought of as a function that takes in some input parameters and returns an output.

Unlike regular programming, this function is not explicitly defined. Instead, a high level architecture is defined and several pieces are filled by looking at the data. This whole process of filling the gaps is driven by different optimisation algorithms as they try to learn complex patterns in the input data that explain the output.


Materialization refers to the process of creating output tables/views in a warehouse by running models. You can define the following fields for materialization:

  • output_type: Determines the type of output you want to create in your warehouse. Allowed values are:

    • table: Output is built and stored in a table in the warehouse.
    • view: Output is built and stored as a view in the warehouse.
    • ephemeral: Output is created in the form of temporary data which serves as an intermediary stage for being consumed by another model.
  • run_type: Determines the run type of models. Allowed values are:

    • discrete (default): In this mode, the model runs in a full refresh mode, calculating its results from the input sources. A SQL model supports only the discrete run type.
    • incremental: In this mode, the model calculates its results from the previous run and only reads row inserts and updates from the input sources. It only updates or inserts data and does not delete anything making it efficient. However, only the identity stitching model supports this mode.

Material tables

When you run the PB models, they produce materials - that is, tables/views in the database that contain the results of that model run. These output tables are known as material tables.


The model’s output is called a prediction. A good model makes predictions that are close to the actual label. You generally need predictions where the labels are not available. In our case, most often the labels come a few days later.

Prediction horizon days

This refer to the number of days in advance when we make the prediction.

For example, statements like “A user is likely to sign-up in the next 30 days, 90 days, etc.” are often time-bound, that is, the predictions are meaningless without the time period.

Profile Builder (PB)

Profile Builder (PB) is a command-line interface (CLI) tool that simplifies data transformation within your warehouse. It generates customer profiles by stitching data together from multiple sources. It takes existing tables or the output of other transformations as input to generate output tables or views based on your specifications.

PB project

A PB project is a collection of interdependent warehouse transformations. These transformations are run over the warehouse to query the data for sample outputs, discover features in the warehouse, and more.

PB model

Any transformation that can be applied to the warehouse data is called a PB model. RudderStack supports various types of models like ID stitching, feature tables, Python models, etc.

Schema versions

Every PB release supports a specific set of project schemas. A project schema determines the correct layout of a PB project, including the exact keys and their values in all project files.

SQL template models

Sometimes the standard model types provided by Profiles are insufficient to capture complex use cases. In such cases, RudderStack supports the use of SQL template models to explicitly templatize SQL.

SQL template models can be used as an input to an entity-var/ input-var or as an edge-source in id-stitcher.


Using time_grain parameter for a model, you can restrict the context timestamp of that model to the specified time boundary. In other words, a feature v1 with time_grain value of a day will look at all the data up to UTC 00:00 hrs of any particular day only.

If you compute that feature at 3:00PM or 5:00PM, the result would be the same because its inputs change only at the very beginning of the day. Similarly, if time_grain for a model is set to a week, it needs to be run only once a week. Running it twice within the week won’t change its results.


Training refers to the process of a machine learning model looking at the available data and trying to learn a function that explains the labels.

Once you train a model on historic users, you can use it to make predictions for new users. You need to keep retraining the model as you get new data so that the model continues to learn any emerging patterns in the new users.

Questions? Contact us by email or on Slack