You are viewing documentation for an older version.

Click here to view the latest documentation.

Profiles FAQ

Commonly asked questions on RudderStack Profiles.

This guide contains solutions for some of the commonly asked questions on Profiles.

For queries or issues not listed in this guide, contact RudderStack Support.

Setup and installation

I have installed Python3, yet when I install and execute pb it doesn’t return anything on screen.

Try restarting your Terminal/Shell/PowerShell and try again.

You can also try to find the location of your Python executable. PB would be installed where the executables embedded in other Python packages are installed.

I am an existing user who updated to the new version and now I am unable to use the PB tool. On Windows, I get the error: 'pb' is not recognized as an internal or external command, operable program or batch file.

Execute the following commands to do a fresh install:

  1. pip3 uninstall profiles-rudderstack-bin
  2. pip3 uninstall profiles-rudderstack
  3. pip3 install profiles-rudderstack --no-cache-dir

I am unable to download, getting ERROR: Package 'profiles-rudderstack' requires a different Python: 3.7.10 not in '>=3.8, <=3.10'

Update your Python 3 to a version greater than or equal to 3.8 and less than or equal to 3.10.

I am unable to download profile builder by running pip3 install profiles-rudderstack even though I have Python installed.

Firstly, make sure that Python3 is correctly installed. You can also try to substitute pip3 with pip and execute the install command.

If that doesn’t work, it’s high likely that Python3 is accessible from a local directory.

  1. Navigate to that directory and try the install command again.
  2. After installation, PB should be accessible from anywhere.
  3. Validate that you’re able to access the path using which pb.
  4. You may also execute echo $PATH to view current path settings.
  5. In case it doesn’t show the path then you can find out where |ProductName| is installed using :substitution-code:pip3 show profiles-rudderstack. This command will display a list of the files associated with the application, including the location in which it was installed, navigate to that directory.
  6. Navigate to /bin subdirectory and execute command ls to confirm that pb is present there.
  7. To add the path of the location where PB is installed via pip3, execute: export PATH=$PATH:<path_to_application>. This will add the path to your system’s PATH variable, making it accessible from any directory. It is important to note that the path should be complete and not relative to the current working directory.

If you still face issues, then you can try to install it manually. Contact us for the executable file and download it on your machine. Follow the below steps afterwards:

When I try to install Profile Builder tool using pip3 I get error message saying: Requirement already satisfied

Try the following steps:

  1. Uninstall PB using pip3 uninstall profiles-rudderstack.
  2. Install again using pip3 install profiles-rudderstack.

Note that this won’t remove your existing data such as models and siteconfig files.

I have multiple models in my project. Can I run only a single model?

Yes, you can. In your spec YAML file for the model you don’t want to run, set materialization to disabled:

    enable_status: disabled

A sample profiles.yaml file highlighted a disabled model:

- name: test_sql
  model_type: sql_template
    validity_time: 24h# 1 day
      run_type: discrete
      enable_status: disabled  // Disables running the model.
    single_sql: |
        {%- with input1 = this.DeRef("inputs/tbl_a") -%}
          select id1 as new_id1, {{input1}}.*
            from {{input1}}
        {%- endwith -%}        
    occurred_at_col: insert_ts
      - select: "new_id1"
        type: test_id
        entity: user

Warehouse permissions

I have two separate roles to read from input tables and write to output tables? How should I define the roles?

You need to create an additional role as a union of those two roles. PB project needs to read the input tables and write the results back to the warehouse schema.

Furthermore, each run is executed using a single role as specified in the siteconfig.yaml file. Hence, it is best in terms of security to create a new role which has read as well as write access for all the relevant inputs and the output schema.

How do I test if the role I am using has sufficient privileges to access the objects in the warehouse?

You can use the pb validate access command to validate the access privileges on all the input/output objects.

Compile command

I am trying to execute the compile command by fetching a repo via GIT URL but getting this error: making git new public keys: ssh: no key found

You need to add the OpenSSH private key to your siteconfig.yaml file. If you get the error could not find expected afterwards, try correcting the spacing in your siteconfig.yaml file.

While trying to segregate identity stitching and feature table in separate model files, I am getting this error: mapping values are not allowed in this context

This is due to the spacing issue in siteconfig.yaml file. You may create a new project to compare the spacing. Also, make sure you haven’t missed any keys.

Command progress & lifecycle

I executed a command and it is taking too long. Is there a way to kill a process on data warehouse?

It could be due to the other queries running simultaneously on your warehouse. To clear them up, open the Queries tab in your warehouse and manually kill the long running processes.

Due to the huge data, I am experiencing long execution times. My screen is getting locked, thereby preventing the process from getting completed. What can I do?

You can use the screen command on UNIX/MacOS to detach your screen and allow the process to run in the background. You can use your terminal for other tasks, thus avoiding screen lockouts and allowing the query to complete successfully.

Here are some examples:

  • To start a new screen session and execute a process in detached mode: screen -L -dmS profiles_rn_1 pb run. Here:
    • -L flag enables logging.
    • -dmS starts as a daemon process in detached mode.
    • profiles_rn_1 is the process name.
  • To list all the active screen sessions: screen -ls.
  • To reattach to a detached screen session: screen -r [PID or screen name].

The CLI was running earlier but it is unable to access the tables now. Does it delete the view and create again?

Yes, every time you run the project, Profiles creates a new materials table and replaces the view.

Hence, you need to grant a select on future views/tables in the respective schema and not just the existing views/tables.

Does the CLI support downloading a git repo using siteconfig before executing pb run ? Or do I have to manually clone the repo first?

You can pass the Git URL as a parameter instead of project’s path, as shown:

pb run -p git@.....

When executing run command, I get a message: Please use sequence number ... to resume this project in future runs . Does it mean that a user can exit using Ctrl+C and later if they give this seq_no then it’ll continue from where it was cancelled earlier?

The pb run --seq_no <> flag allows for the provision of a sequence number to run the project. This flag can either resume an existing project or use the same context to run it again.

With the introduction of time-grain models, multiple sequence numbers can be assigned and used for a single project run.

What flag should I set to force a run for same end time, even if a previous run exists?

You can execute pb run --force --model_refs models/my_id_stitcher,entity/user/user_var_1,entity/user/user_var_2,...

Can the hash change even if schema version did not change?

Yes, as the hash versions depends on project’s implementation while the schema versions are for the project’s YAML layout.

Identity stitching

There are many large size connected components in my warehouse. To increase the accuracy of stitched data, I want to increase the number of iterations. Is it possible?

The default value of the largest diameter, that is, the longest path length in connected components, is 30.

You can increase it by defining a max_iterations key under model_spec of your ID stitcher model in models/profiles.yaml, and specifying its value as the max diameter of connected components.

Note that the algorithm can give incorrect results in case of large number of iterations.

Do I need to write different query each time for viewing the data of created tables?

No, you can instead use a view name, which always points to the latest created material table. For example, if you’ve defined user_stitching in your models/profiles.yaml file, then execute SELECT * FROM MY_WAREHOUSE.MY_SCHEMA.user_stitching.

In my model, I have set the key validity_time: 24h. What happens when the validity of generated tables expire? Will re-running the identity stitching model generate the same hash until the validity expires?

Firstly, hash does not depend on the timestamp, it depends on the yaml in the underlying code. That’s why the material name is material_name_HASH_SEQNO. The sequence number (SEQNO) depends on timestamp.

Secondly, a material generated for a specific timestamp (aside for the timeless timestamp) is not regenerated unless you do a pb run --force. The CLI checks if the material you are requesting already exists in the database, and if it does, returns that. The validity_time is an extension of that.

For a model with validity_time: 24h and inputs having the timestamp columns, if you request a material for latest time, but one was generated for that model 5 minutes ago, the CLI will return that one instead. Using the CLI to run a model always generates a model for a certain timestamp, it’s just if you don’t specify a timestamp then it uses the current timestamp.

So, for a model with validity_time (vt), and having the timestamp columns, if you request a material for t1, but one already exists for t0 where t1-vt <= t0 <= t1, the CLI will return that one instead.

If multiple materials exist that satisfy the requirement, then it returns the one with the timestamp closest to t1.

I want to use customer_id instead of main_id as the ID type. So I changed the name in pb_project.yaml, however now I am getting this error: Error: validating project sample_attribution: listing models for child source models/: error listing models: error building model domain_profile_id_stitcher: main id type main_id not in project id types.

In addition to making changes in the file pb_project.yaml file, you also need to set main_id_type: customer_id in the models/profiles.yaml file.

I ran identity stitching model but not able to see the output tables under the list of tables in Snowflake. What might be wrong?

In Snowflake, you can check the Databases > Views dropdown from the left sidebar. For example, if your model name is domain_profile_id_stitcher, you should be able to see the table with this name. In case it is still not visible, try changing the role using dropdown menu from the top right section.

I am using a view as an input source but getting ann error that the view is not accessible, even though it exists in DB.

Views need to be refreshed from time-to-time. You can try recreating the view in your warehouse and also execute a select * on the same.

What might be the reason for following errors:

  • processing no result iterator: pq: cannot change number of columns in view. The output view name already exists in some other project. To fix this, try dropping the view or changing its name.

  • creating Latest View of moldel 'model_name': processing no result iterator: pq: cannot change data type of view column "valid_at" Drop the view domain_profile in your warehouse and execute the command again.

  • processing no result iterator: pq: column "rudder_id" does not exist. This occurs when you execute a PB project with a model name, having main_id in it, and then you run another project with the same model name but no main_id. To resolve this, try dropping the earlier materials using cleanup materials command.

I have a source table in which email gets stored in the column for user_id, so the field has a mix of different ID types. I have to tie it to another table where email is a separate field. When doing so, I have two separate entries for email, as type email and user_id. What should I do?

You can implement the following line in the inputs tables in question:

  - select: case when lower(user_id) like '%@%' THEN lower(user_id) else null end
    type: email 
    entity: user
    to_default_stitcher: true

How do I validate the results of identity stitching model?

Contact RudderStack Support if you need help in validating the clusters.

Which identifiers would you recommend that I include in the ID stitcher for an ecommerce Profiles project?

We suggest including identifiers that are unique for every user and can be tracked across different platforms and devices. These identifiers might include but not limited to:

  • Email ID
  • Phone number
  • Device ID
  • Anonymous ID
  • User names

These identifiers can be specified in the file profiles.yaml file in the identity stitching model.

Remember, the goal of identity stitching is to create a unified user profile by correlating all of the different user identifiers into one canonical identifier, so that all the data related to a particular user or entity can be associated with that user or entity.

Feature Table

How can I run a feature table without running its dependencies?

Suppose you want to re-run the user entity_var days_active and the rsTracks input_var last_seen for a previous run with seq_no 18.

Then, execute the following command:

pb run --force --model_refs entity/user/days_active,inputs/rsTracks/last_seen --seq_no 18

Is it possible to run the feature table model independently, or does it require running alongside the ID stitcher model?

You can provide a specific timestamp while running the project, instead of using the default latest time. PB recognizes if you have previously executed an identity stitching model for that time and reuses that table instead of generating it again.

You can execute a command similar to: pb run --begin_time 2023-06-02T12:00:00.0Z --end_time 2023-06-03T12:00:00.0Z. Note that:

  • To reuse a specific identity stitching model, the timestamp value must match exactly to when it was run.
  • If you have executed identity stitching model in the incremental mode and do not have an exact timestamp for reusing it, you can select any timestamp greater than a non-deleted run. This is because subsequent stitching takes less time.
  • To perform another identity stitching using PB, pick a timestamp (for example, 1681542000) and stick to it while running the feature table model. For example, the first time you execute pb run --begin_time 2023-06-02T12:00:00.0Z --end_time 2023-06-03T12:00:00.0Z, it will run the identity stitching model along with the feature models. However, in subsequent runs, it will reuse the identity stitching model and only run the feature table models.

While trying to add a feature table, I get an error at line 501, but I do not have these many lines in my YAML.

The line number refers to the generated SQL file in the output folder. Check the console for the exact file name with the sequence number in the path.

While creating a feature table, I get this error: Material needs to be created but could not be: processing no result iterator: 001104 (42601): Uncaught exception of type 'STATEMENT ERROR': 'SYS _W. FIRSTNAME' in select clause is neither an aggregate nor in the group by clause.

This error occurs when you use a window function any_value that requires a window frame clause. For example:

  - entity_var:
      name: email
      select: LAST_VALUE(email)
      from: inputs/rsIdentifies
        - timestamp desc

Is it possible to create a feature out of an identifier? For example, I have a RS user_main_id with two of user_ids stitched to it. Only one of the user_ids has a purchase under it. Is it possible to show that user_id in the feature table for this particular user_main_id?

If you know which input/warehouse table served as the source for that particular ID type, then you can create features from any input and also apply a WHERE clause within the entity_var.

For example, you can create an aggregate array of user_id’s from the purchase history table, where total_price > 0 (exclude refunds, for example). Or, if you have some LTV table with user_id’s, you could exclude LTV < 0.


Are there any best practices I should follow when writing the PB project’s YAML files?

  • Use spaces instead of tabs.
  • Always use proper casing. For example: id_stitching, and not id_Stitching.
  • Make sure that the source table you are referring to, exists in data warehouse or data has been loaded into it.
  • If you’re pasting table names from your Snowflake Console, remove the double quotes in the inputs.yaml file.
  • Make sure your syntax is correct. You can compare with the sample files.
  • Indentation is meaningful in YAML, make sure that the spaces have same level as given in sample files.

How do I debug my YAML file step-by-step?

You can use the --model_args parameter of the pb run command to do so. It lets you run your YAML file till a specific feature/tablevar. For example:

$ pb run -p samples/attribution --model_args domain_profile:breakpoint:blacklistFlag

See run command for more information.

This is only applicable to versions prior to v0.9.0.

Can I use double quotes when referencing another entity_var in a macro?

You can use an escape character. For example:

  - entity_var:
      name: days_since_last_seen
      select: "{{macro_datediff('{{user.Var(\"max_timestamp_bw_tracks_pages\")}}')}}"

Also, if have a case statement, then you can add something like the following:

select: CASE WHEN {{user.Var("max_timestamp_tracks")}}>={{user.Var("max_timestamp_pages")}} THEN {{user.Var("max_timestamp_tracks")}} ELSE {{user.Var("max_timestamp_pages")}} END

ML / Python Models

Despite deleting WhtGitCache folder and adding keys to siteconfig, I get this error: Error: loading project: populating dependencies for project:base_features, model: churn_30_days_model: getting creator recipe while trying to get ProjectFolder: fetching git folder for git@github.com:rudderlabs/rudderstack-profiles-classifier.git: running git plain clone: repository not found

If your token is valid, then replace git@github.com:rudderlabs/rudderstack-profiles-classifier.git with https://github.com/rudderlabs/rudderstack-profiles-classifier.git in the profile-ml file.


Why am I getting Authentication FAILED error on my data warehouse while executing the run/compile commands?

Some possible reasons for this error might be:

  • Incorrect warehouse credentials.
  • Insufficient user permissions to read and write data. You can ask your administrator to change your role or grant these privileges.

Why am I getting Object does not exist or not authorized error on running this SQL query: SELECT * FROM "MY_WAREHOUSE"."MY_SCHEMA"."Material_domain_profile_c0635987_6"?

You must remove double quotes from your warehouse and schema names before running the query, that is SELECT * FROM MY_WAREHOUSE.MY_SCHEMA.Material_domain_profile_c0635987_6.

Is there a way to obtain the timestamp of any material table?

Yes, you can use the GetTimeFilteringColSQL() method to get the timestamp column of any material. It filters out rows based on the timestamp. It returns the occurred_at_col in case of an event_stream table or valid_at in case the material has that column. In absense of both, it returns an empty string. For example:

  SELECT * FROM {<from_material>}
      <from_material>.GetTimeFilteringColSQL() > <some_timestamp>;

What is the difference between setting up Profiles in the RudderStack dashboard and Profile Builder CLI tool?

You can run Profiles in the RudderStack dashboard or via Profile Builder CLI.

The main difference is that the RudderStack dashboard only generates outputs based on the pre-defined templates. However, you can augment those outputs by downloading the config file and updating it manually.

On the other hand, the CLI tool lets you achieve the end-to-end flow via creating a Profile Builder project.

Does the Profiles tool have logging enabled by default for security and compliance purposes?

Logging is enabled by default for nearly all the commands executed by CLI (init, validate access, compile, run, cleanup, etc.). Logs for all the output shown on screen are stored in the file logfile.log in the logs directory of your project folder. This includes logs for both successful and failed runs. RudderStack appends new entries at the end of the file once a command is executed.

Some exceptions where the logs are not stored are:

  • query: The logger file stores the printing output and does not store the actual database output. However, you can access the SQL queries logs in your warehouse.
  • help: For any command.

In the warehouse, I see lots of material_user_id_stitcher_ tables generated in the rs_profiles schema. How do I identify the latest ID stitched table?

The view user_id_stitcher will always point to the latest generated ID stitcher. You may check its definition to see the exact table name it is referring to.

How can I remove the material tables that are no longer needed?

To clean up all the materials older than a specific duration, for example 10 days, execute the following command:

pb cleanup materials -r 10

The minimum value you can set here is 1. So if you have run the ID stitcher today, then you can remove all the older materials using pb cleanup materials -r 1.

Which tables and views are important in Profiles schema that should not be deleted?

  • material_registry
  • material_registry_<number>
  • pb_ct_version
  • ptr_to_latest_seqno_cache
  • wht_seq_no
  • wht_seq_no_<number>
  • Views whose names match your models in the YAML files.
  • Material tables from the latest run (you may use the pb cleanup materials command to delete materials older than a specific duration).

I executed the auto migrate command and now I see a bunch of nested original_project_folder. Are we migrating through each different version of the tool?

This is a symlink to the original project. Click on it in the Finder (Mac) to open the original project folder.

**I am getting a ** ssh: handshake failed error when referring to a public project hosted on GitHub. It throws error for https:// path and works fine for ssh: path. I have set up token in GitHub and added to siteconfig.yaml file but I still get this error.

You need to follow a different format for gitcreds: in siteconfig. See SiteConfiguration for the format.

After changing siteconfig, if you still get an error, then clear the WhtGitCache folder inside the directory having the siteconfig file.

If I add filters to id_types in the project file, then do all rows that include any of those values get filtered out of the analysis, or is it just the specific value of that id type that gets filered?

The PB tool does not extract rows. Instead, it extracts pairs from rows.

So if you had a row with email, user_id, and anonymous_id and the anonymous_id is excluded, then the PB tool still extracts the email, user_id edge from the row.

In the material registry table, what does status: 2 mean?

  • status: 2 means that the material has successfully completed its run.
  • status: 1 means that the material did not complete its run.

I am using Windows and get the following error: Error: while trying to migrate project: applying migrations: symlink <path>: A required privilege is not held by the client.

Your user requires privileges to create a symlink. You may either grant extra privileges to the user or try with a user containing Admin privileges on PowerShell. In case that doesn’t help, try to install and use it via WSL (Widows subsystem for Linux).

Can I specify any git account like CommitCode or BitBucket while configuring a project in the web app?

Profiles UI only supports the repos hosted on GitHub currently.

Questions? Contact us by email or on Slack