Glossary

Familiarize yourself with the commonly used terms across RudderStack Profiles.

Edge sources

The edge_sources field provides the input sources for an identity stitching model. You can specify it in the models/profiles.yaml file to list the input sources defined in the inputs.yaml file.

Entity

Entity refers to an object for which you can create a profile. An entity can be a user, customer, product, or any other object that requires profiling.

Entity var

The entity_var field provides the input sources for the feature table model along with the feature name, column names, description, values, etc. You can specify it in the models/profiles.yaml file and reference it in other parts of the model. You can also define an entity_var as a feature.

Features

Features are inputs for the machine learning model. In a general sense, features are pieces of user information we already know. For example, number of days they opened the app in the past week, items they left in the cart, etc.

Inputs

Inputs refers to the input data sources used to create the material (output) tables in the warehouse. The inputs file (models/inputs.yaml) lists the tables/views you use as input sources, including the column name and SQL expression for retrieving the values.

You can use data from various input sources such as event stream (loaded from event data), ETL extract (loaded from Cloud Extract), and any existing tables in the warehouse (generated by external tools).

Input var

The input_var field is similar to entity_var, except that each value in it is associated with a row of the specified input instead of the row of a feature table.

While an entity_var acts as a helper column to the feature table, an input_var cacts as a helper column to the input. Also, an input_var can’t be defined as a feature nor does it gets stored in the final output table.

Label

Label is the output of the machine learning model and is the metric we want to predict. In our case, it is the unknown user trait we want to know in advance.

Machine learning model

A machine learning model can be thought of as a function that takes in some input parameters and returns an output.

Unlike regular programming, this function is not explicitly defined. Instead, a high level architecture is defined and several pieces are filled by looking at the data. This whole process of filling the gaps is driven by different optimisation algorithms as they try to learn complex patterns in the input data that explain the output.

Materialization

Materialization refers to the process of creating output tables/views in a warehouse using PB models. You can define the run_type to create the output table:

  • run_type: Possible values are:
    • discrete: Calculates the model result from it’s inputs whenever run (default mode). A SQL model supports only discrete run type.
    • incremental: The model reads updates from input sources and results from the previous run. It only updates or inserts data and does not delete anything. The incremental mode is more efficient. However, only identity stitching model supports it.

Material tables

When you run the PB models, they produce materials - that is, tables/views in the database that contain the results of that model run. These output tables are known as material tables.

Predictions

The model’s output is called a prediction. A good model makes predictions that are close to the actual label. You generally need predictions where the labels are not available. In our case, most often the labels come a few days later.

Prediction horizon days

This refer to the number of days in advance when we make the prediction.

For example, statements like “A user is likely to sign-up in the next 30 days, 90 days, etc.” are often time-bound, that is, the predictions are meaningless without the time period.

Profile Builder (PB)

Profile Builder (PB) is a command-line interface (CLI) tool that simplifies data transformation within your warehouse. It generates customer profiles by stitching data together from multiple sources. It takes existing tables or the output of other transformations as input to generate output tables or views based on your specifications.

PB project

A PB project is a collection of interdependent warehouse transformations. These transformations are run over the warehouse to query the data for sample outputs, discover features in the warehouse, and more.

PB model

Each transformation that can be applied to the warehouse data is called a PB model.

Schema versions

Every PB release supports a specific set of project schemas. A project schema determines the correct layout of a PB project, including the exact keys and their values in all project files.

Training

Training refers to the process of a machine learning model looking at the available data and trying to learn a function that explains the labels.

Once you train a model on historic users, you can use it to make predictions for new users. You need to keep retraining the model as you get new data so that the model continues to learn any emerging patterns in the new users.


Questions? Contact us by email or on Slack