Version:

Profiles Copilot

Get a high level aggregate analysis of your ID graph across entities and a deep dive into the analysis of single IDs.

Profiles Copilot is an added feature embedded within the PB core CLI tool and helps you analyze the health of your ID graph.

info
Profiles Copilot is only available in pb version 0.19.0.

This feature provides the following insights:

  • A high level aggregate analysis of your overall ID graph across your entities.
  • The option to select a single customer and provide a drilled-down analysis and visualization about a customer’s individual graph.
  • An optional interactive LLM session where you can ask questions about your ID graph and get answers to them.

Prerequisites

  • Profiles v0.19.0 installed locally.
  • profiles-mlcorelib library (v0.6.0 or above) installed in your Python environment (v3.9.0 to v3.11.10) using pip install profiles-mlcorelib. Note that it should be the same Python environment as the profiles-rudderstack library.
  • Generate a Personal Access Token from the RudderStack dashboard. Then, add it to your site configuration under the rudderstack_access_token key (similar to the connections key).
  • A successful Profiles project run on the same pb version at least once that has generated an ID graph. Also, the ID graph definition should not have changed after the run.

Workflow overview

To use this feature, go to your Profiles project configuration and run the below command within your terminal:

pb audit id_stitcher

The audit tool then:

  1. Scans the project to look for ID stitcher models.

  2. Asks the customers to select one model in case there are multiple ID stitcher models.

  3. For the selected model, presents the below three audit stages:

Static ID graph summary

The pb audit id_stitcher command gives you a detailed report on the ID stitcher as an output. It contains the ID graph summary that helps you understand:

  • ID graph’s definition, like ID types, main ID, etc.

  • Count of values, like:

    • Counts of each ID types before and after ID stitching.
    • Most widely connected IDs.
    • Resolved main ID stats on the number of IDs connected together, like average, 99th percentile, etc.
    • IDs that are never connected to any other ID.

Note that for all the counts, the tool shows some samples but they are at an aggregate level, not on an individual main_id level.

Click here to view a sample report. Note that the exact text and format may differ.
main id for entity user: USER_MAIN_ID


Analyzing ID Stitcher for entity: user
        Node types: ['id', 'anonymous_id', 'contact_id', 'email', 'lead_id', 'user_id']
        Unique IDs of each type and their counts:
                id: 10958
                anonymous_id: 4294899
                contact_id: 79854
                email: 122661
                lead_id: 75482
                user_id: 296346
Total Distinct IDs
        Before stitching: 4880200
        After stitching: 4290186


Top 10 nodes by edge count:
                ID: alex@example.com, Type: email, Edges: 2503
                ID: alex@example.com, Type: user_id, Edges: 2140
                ID: alice@example.com, Type: email, Edges: 1725
                ID: abc@example.com, Type: email, Edges: 946

Top 5 nodes of different types, before id stitching, by edge counts:

        type id by edge count:
                ID: a2ab0609d6c10187ff11, Edges: 2
                ID: a2ab060a860ee7b512, Edges: 2
                ID: a2ab0605e232d3c609, Edges: 2
                ID: a2ab060bb95dcaef14, Edges: 2
                ID: a2ab0600d159d259, Edges: 2

        type anonymous_id by edge count:
                ID: 49a8bebe-4d9d-47e5-8ea3-29153fc33dcb, Edges: 71
                ID: d81d2d84-71a8-46db-9d30-81502051c214, Edges: 61
                ID: 3421a313-189e-45d2-9270-e0ae3181e07d, Edges: 47

        type contact_id by edge count:
                ID: 003QP000001SRUjYAO, Edges: 7
                ID: 003QP000001LpBGYA0, Edges: 6

        type email by edge count:
                ID: alex@example.com, Edges: 2503
                ID: alice@example.com, Edges: 1725

        type lead_id by edge count:
                ID: 00Q5x00001zW1Y9EAK, Edges: 3
                ID: 00QQP00000800rR2AQ, Edges: 3

        type user_id by edge count:
                ID: alex@example.com, Edges: 2140
                ID: 28HhwSx1WEU56QM5sY7r74MEf9n, Edges: 1526

Average edge count per node (before stitching): 1.331919

Average edge count by node type (before stitching):
                id: 2.518149
                anonymous_id: 3.122989
                contact_id: 3.068089
                email: 3.406679
                lead_id: 2.453102
                user_id: 4.522037

                POST ID STITCHING ANALYSIS


Cluster size after stitching:
                No:of main ids with a single other id (Singleton nodes): 3998772
                Average other id counts: 1.137526
                Min other id counts for a single main id: 1
                Max other id counts for a single main id: 13711
                25th percentile: 1
                50th percentile: 1
                75th percentile: 1
                90th percentile: 1
                99th percentile: 4

Average number of ids of different id types, per main id, after stitching:
                id: 1.059563
                anonymous_id: 1.026212
                contact_id: 1.031319
                email: 1.049362
                lead_id: 1.104071
                user_id: 1.149782

Top 5 biggest clusters after id stitching (and the distinct id types in each cluster):
        Main ID: rid0e06a33cfdbccc7679e635e78f0f391f, Size: 13711
                anonymous_id: 11503
                user_id: 1205
                email: 457
                lead_id: 375
                contact_id: 117
                id: 54
        Main ID: rid039ea1d83a8484fa66f153143f013ac7, Size: 1526
                anonymous_id: 1525
                user_id: 1
        Main ID: rid25ddb86517babaa3f45dceef3c400330, Size: 964
                anonymous_id: 949
                user_id: 9
                email: 2
                contact_id: 2
                lead_id: 2
        Main ID: ridf0fef860592708f92f211df285bef625, Size: 833
                anonymous_id: 618
                user_id: 88
                email: 47
                lead_id: 47
                contact_id: 32
                id: 1
        Main ID: rid2e9139338ef8ccfec085e259da0ef3fb, Size: 719
                anonymous_id: 718
                user_id: 1

Singleton Node Analysis (after stitching):
anonymous_id: 3963171 nodes (92.28%) not connected to any other ID type
user_id: 28686 nodes (9.68%) not connected to any other ID type
contact_id: 5376 nodes (6.73%) not connected to any other ID type
lead_id: 1374 nodes (1.82%) not connected to any other ID type
id: 165 nodes (1.51%) not connected to any other ID type

Checking for potential issues:
WARNING:__main__:Potential overstitching detected: Node alex@example.com of type email has 2503 edges

Checking for missing direct edges between node types (but indirect edges can exist):
WARNING:__main__:Warning: No direct edges found between id and anonymous_id
WARNING:__main__:Warning: No direct edges found between id and contact_id
WARNING:__main__:Warning: No direct edges found between id and lead_id
WARNING:__main__:Warning: No direct edges found between id and user_id
WARNING:__main__:Warning: No direct edges found between anonymous_id and id
WARNING:__main__:Warning: No direct edges found between anonymous_id and contact_id
WARNING:__main__:Warning: No direct edges found between anonymous_id and lead_id
WARNING:__main__:Warning: No direct edges found between contact_id and id
WARNING:__main__:Warning: No direct edges found between contact_id and anonymous_id
WARNING:__main__:Warning: No direct edges found between contact_id and user_id
WARNING:__main__:Warning: No direct edges found between lead_id and id
WARNING:__main__:Warning: No direct edges found between lead_id and anonymous_id
WARNING:__main__:Warning: No direct edges found between lead_id and user_id
WARNING:__main__:Warning: No direct edges found between user_id and id
WARNING:__main__:Warning: No direct edges found between user_id and contact_id
WARNING:__main__:Warning: No direct edges found between user_id and lead_id

 Check for missing edges between node types (direct or indirect) after id stitching:
No missing edges found between node types. GREAT!!


ANALYSIS COMPLETE FOR ENTITY: user

Cluster analysis

The pb audit id_stitcher command also lets you understand specific clusters (main_ids), for example, how an email ID E1 is linked to a Salesforce ID S1.

The audit tools gives a snapshot of the complete cluster and also points to the IDs that may be problematic.

Click here to view a sample
You can explore specific clusters by entering an ID to see how the other ids are all connected and the cluster is formed.
The ID can be either the main ID or any other ID type.
Enter an ID to visualize (or 'skip' to skip this step): 
Acme

Your network visualization is ready! We've saved an interactive map of your data connections here:
/Users/user_name/git_repos/profiles-prod/graph_outputs/Acme_graph.html
You can open this file in your web browser to explore the network visually.
Cluster Summary:

        Network overview:
        -----------------
        1. Size: 7 ids connected by 9 links
        2. Connectivity: On average, each id is connected to 2.6 other ids
        3. Spread: The farthest connected ids are 2 steps apart

        Key Players:
        ------------       

        Most connected ids:
        - ID: acme.com, No:of edges: 6
        - ID: Acme, No:of edges: 2
        - ID: 000123XYZ, No:of edges: 2
        - ID: Acme America, No:of edges: 2
        - ID: 000456ABC, No:of edges: 2
        - ID: 000789DEF, No:of edges: 2
        - ID: Acme Inc, No:of edges: 2

        Critical Linking Nodes, and percent of id pairs they are connecting (these are the nodes that act as bridges across all nodes):
        - ID: acme.com, % of shortest paths: 80.00%

The audit tool also outputs an HTML file that helps you understand the cluster visually. It is stored locally and is loaded proactively by highlighting the path so you can revisit it later. A sample file is shown:

Cluster analysis

In the above example, you can see how acme.com joins Acme Inc and Acme America entities which are probably different. You can address this by removing the acme.com domain.

Interactive LLM session

info
RudderStack supports this feature only for the Snowflake warehouse currently. Support for the other warehouses is coming soon.

The audit tool also provides an open-ended text2sql session where you can ask arbitrary questions related to the ID graph. The question goes to an LLM which prepares a SQL query.

The SQL query then runs on your data warehouse and generates an answer.

Click here to view a sample
Question:
        how many accounts do we have with more than one domain?

Thought:
        Ran following queries to get the answer:

        SELECT COUNT(*) AS accounts_with_multiple_domains
FROM (
    SELECT ACCOUNT_MAIN_ID
    FROM account_id_graph
    WHERE other_id_type = 'domain'
    GROUP BY ACCOUNT_MAIN_ID
    HAVING COUNT(DISTINCT other_id) > 1
) subquery;;

        SELECT COUNT(DISTINCT ACCOUNT_MAIN_ID) AS total_accounts_with_domains
FROM account_id_graph
WHERE other_id_type = 'domain';;

Response:

Now we have more context to provide a comprehensive answer:

There are 100 accounts that have more than one domain associated with them. This is out of a total of 25,000 accounts that have at least one domain.

To put this into perspective:
- Approximately 0.50% of accounts with domains (100 out of 25,000) have multiple domains associated with them.
- The vast majority (about 99.75%) of accounts with domains are associated with just a single domain.

This distribution suggests that while most accounts in the system are linked to a single domain, there is a small but significant number of accounts that operate across multiple domains. These could represent larger organizations, companies with multiple brands, or entities that have expanded their online presence across different domain names.

You can ask next question, type exit to quit, or back to select a different entity

Q: what are these accounts? why do they have multiple domains?
INFO:api_server.sessions.chat_session@2024-09-30 11:02:28,987: Getting response for prompt: what are these accounts? why do they have multiple domains?
INFO:api_server.sessions.chat_session@2024-09-30 11:02:28,988: Converted history to prompt
INFO:api_server.sessions.chat_session@2024-09-30 11:02:28,988: Making llm calls with the extended prompt

Response: 

Based on the query results, we can answer the user's question:

These are the top 10 accounts with multiple domains, ordered by the number of domains they have:

1. An account (ACCOUNT_MAIN_ID: rid0a615df48d4ae923f518ab3d052ded1f) with 8 domains, including <domain1>, <domain2>, <domain3>, and others.
2. Apex (ACCOUNT_MAIN_ID: rid7492d64d982d1e6b8531e12a39513d6e) with 4 domains across different countries.
3. Aurora (ACCOUNT_MAIN_ID: rid149ba1fd2a2c59b30663fb22d4aabfaa) with 4 domains in different countries.
4. Acme (ACCOUNT_MAIN_ID: rid49a8de71ef3cbe73d0359965e000a20b) with 3 domains in different Nordic countries.
5. A company (ACCOUNT_MAIN_ID: rid0a256a02fc2aaf3bf812db4df78b95fa) with 3 domains.
6. Clement (ACCOUNT_MAIN_ID: rid44cb9148b8c8b5b608d5a8ecebc11e4b) with 3 related domains.
7. Hero (ACCOUNT_MAIN_ID: rid301a9453dddc0bca3a7315af38370c3e) with 3 domains in different countries.
8. Buffalo (ACCOUNT_MAIN_ID: rid0c22fb973e9b75dd2356072f29c4f91e) with 3 related domains.
9. Majority (ACCOUNT_MAIN_ID: rid5d63f8165e5fa418cef359f0dcde888a) with 2 domains.
10. Super (ACCOUNT_MAIN_ID: rid21bb08ad7fd688b23e43d00b0b54d3c0) with 2 domains.

As for why these accounts have multiple domains, we can infer several reasons based on the patterns observed:

1. International presence: Companies like Apex, Aurora, Acme, and Buffalo have domains for different countries or regions.

2. Brand diversification: The first account with 8 domains seems to represent multiple brands or services under one parent company.

...response continued...

These multiple domains allow companies to tailor their online presence to different markets, products, or strategies while maintaining a unified backend infrastructure for user identification and data management.

Privacy policy

For an interactive LLM session, RudderStack sends the user prompt and some data (for example, table and column names, SQL query response, etc.) to an external LLM service.

RudderStack currently uses two LLM service providers:

  • OpenAI (Privacy policy: here)
  • Anthropic (Privacy policy: here)
warning

Both the OpenAI and Anthropic privacy policies state that the data in APIs is retained for 30 days to monitor for any policy violations.

It will not be used for training purposes and will be deleted after 30 days.

The tool asks for your consent to use these LLMs before making the first LLM call. You can opt out of the LLM services if there are concerns about sending your data to the external providers.

info
If you opt out of the LLM services, the audit tool never makes any calls to the LLM provider.

Any consent you provide (opt-in or opt-out) is stored locally in the ~/.pb/preferences.yaml file on the same device and gets reused in all the future profiles audit runs.

To reset your consent, manually delete the consent from the ~/.pb/preferences.yaml file.

Tracking

Note that for every command run, the tool tracks the audit_start and audit_end events.

It creates an anonymous_id in the ~/.pb/preferences.yaml file and uses this ID for future runs. All the runs get associated with this ID. On the very first run when anonymous_id is created, you will see the following disclaimer:

  Privacy Policy regarding Telemetry:
  - This tool collects usage statistics to help us improve the product.
  - We do not collect or store any sensitive information from your models.
  - You can opt out by adding 'telemetry_opted_out: true' to ~/.pb/preferences.yaml
info
The audit tool also collects the IP address of the device as a part of the standard RudderStack call. However, it does not collect any other user identifier.

You can add a line telemetry_opted_out: true in this preference file to opt out of tracking.



Questions? Contact us by email or on Slack