Blog

How we track multi-agent AI systems without losing visibility into agent orchestration

How we track multi-agent AI systems without losing visibility into agent orchestration

Anudeep

Anudeep

SDET at RudderStack

10 min read

|

Published:

January 14, 2026

How we track multi-agent AI systems without losing visibility into agent orchestration

While deploying multi-agent AI systems—orchestrators, agent swarms, specialized agent teams—we faced a new analytics challenge. While single-agent systems follow a straightforward user → AI → response pattern, multi-agent architectures introduce complexity: agent-to-agent handoffs, parallel execution, tool calling, and dynamic routing.

Without proper instrumentation, we were flying blind.

The core problem: Traditional AI analytics capture the user's question and final answer but create a black box around everything that happens in between. You know your multi-agent system is working, but you can't answer critical questions like which agents are bottlenecks, whether you're using the right models for each agent, where costs are accumulating, or how to optimize orchestration strategies.

This visibility gap had real business consequences. Teams wasted budget running expensive models for simple tasks, missed optimization opportunities because they couldn't identify slow agents, and struggled to prove ROI when they couldn't connect agent performance to user outcomes.

So we decided to extend RudderStack's AI Product Analytics Spec with multi-agent tracking capabilities. We defined a standardized event schema that now captures both user-facing interactions and internal agent orchestration, plus practical implementation patterns using RudderStack's data infrastructure.

💡 Follow this GitHub discussion for future updates

By following the same framework, you can also track agent systems revealing optimization opportunities, controlling costs, and connecting agent performance to business metrics.

Why multi-agent systems need different analytics

Single-agent vs multi-agent: A different paradigm

Single-agent systems are linear. A user asks a question, an LLM processes it, and you get a response. Analytics is straightforward: track the prompt, the response, latency, and cost.

Multi-agent systems are orchestrated workflows. A user asks a question, an orchestrator analyzes it, routes to specialized agents (sometimes in parallel), those agents may call tools or other LLMs, results get synthesized, and finally the user receives an answer. The path from question to answer involves multiple decision points, handoffs, and processing steps.

Aspect Single-Agent Multi-Agent APIWebhook

Flow

Linear: User → LLM → Response

Orchestrated: User → Router → [Agent1, Agent2, Agent3] → Synthesis → Response

Analytics focus

Prompt quality, response quality, latency

Agent selection, orchestration patterns, step-level performance

Optimization

Model choice, prompt engineering

Agent specialization, routing logic, parallel vs sequential execution

Cost structure

Single model cost per interaction

Variable costs across multiple agents and models

What you're missing without agent-level visibility

Without agent-level analytics, you can't measure the unit economics of your AI. This can cause various issues:

Performance bottlenecks: You see that responses take 8 seconds on average, but you don't know if that's because your query-building agent is slow, your validation agent is making redundant calls, or your synthesis step is inefficient.

Cost optimization opportunities: You're running GPT-4 for every agent, but your validator agent could use GPT-3.5-turbo and save 90% on costs without impacting quality. Your data extraction agent could use a specialized model that's 10x cheaper. You won't discover these opportunities without per-agent cost visibility.

Model selection: Different agents have different requirements. Your planner needs reasoning capability (use a frontier model), but your formatter just structures data (use a fast, cheap model). Without metrics showing which agents are cost/latency bottlenecks, you can't make informed model choices.

Orchestration patterns: Are your agents executing sequentially when they could run in parallel? Is your routing logic sending requests to the wrong specialists? Without tracking agent collaboration patterns, you can't optimize your orchestration strategy.

Tool calling efficiency: Your agents make external API calls—to databases, knowledge bases, or other services. These tool calls often dominate latency and can fail silently. Without step-level tracking, you can't identify problematic integrations.

Real business impact: With proper agent-level analytics, we discovered:

  • 30-50% cost reduction by using appropriate models per agent
  • 40-60% latency improvement by optimizing orchestration
  • 2-3x increase in success rates by identifying and fixing failing agent steps
  • Clear attribution of outcomes to specific agent improvements

Understanding the ID hierarchy for multi-agent tracking

Multi-agent systems require a three-level identification structure to connect all events in a workflow:

Level 1: Conversation ID

The complete user conversation spanning multiple questions and answers. A user might have a 10-minute conversation with your AI assistant, asking several questions. All of those interactions share one conversation_id.

Level 2: Interaction ID

A single user question and the corresponding final answer. Within a conversation, each time the user asks a question and receives a response, that's one interaction with a unique interaction_id. This is the fundamental unit of user experience.

Level 3: Agent session ID

An individual agent's execution within an interaction. When the orchestrator routes work to a specialized agent, that agent's complete execution (including all its LLM calls and tool calls) gets a unique agent_session_id.

Real-world example: Customer Data Platform Assistant

SH
conversation_id: "conv_abc123"
├── interaction_id: "int_1"
│ User: "Build me a segment of high-value customers who purchased in the last 30 days"
│ │
│ ├── agent_session_id: "session_1" (Planner Agent)
│ │ └── Analyzes request, creates execution plan
│ │
│ ├── agent_session_id: "session_2" (Query Builder Agent)
│ │ ├── Step 1: LLM call to generate SQL
│ │ ├── Step 2: Tool call to validate SQL syntax
│ │ └── Step 3: Tool call to execute query
│ │
│ ├── agent_session_id: "session_3" (Results Validator Agent)
│ │ ├── Step 1: LLM call to check data quality
│ │ └── Step 2: Tool call to get sample records
│ │
│ └── Final answer: "I've created a segment with 1,247 high-value customers..."
└── interaction_id: "int_2"
User: "Can you add only customers from California?"
├── agent_session_id: "session_4" (Query Modifier Agent)
│ └── Updates existing query with geographic filter
└── Final answer: "Updated! Now showing 342 California customers..."

This hierarchy enables powerful analytics:

  • Track end-to-end user experience at the interaction level
  • Analyze which agents contribute to slow interactions
  • Understand agent collaboration patterns across sessions
  • Identify friction points in the user-agentic journey
  • Calculate costs and token usage per agent type

How to map technical execution to user journeys:

  • Conversation: The full customer relationship/session.
  • Interaction: The specific user intent or 'Job to be Done.'
  • Agent Session: The cost and effort required to fulfill that intent.

The multi-agent analytics spec

This specification provides comprehensive tracking for multi-agent systems at two levels:

  1. Standard events capture the complete user journey and outcomes, and
  2. Agent Performance events unlock granular visibility into unit economics and latency drivers. This allows you to align your data strategy with your optimization goals.

Standard events

NOTE: In this spec, we extend the original spec for AI Product Analytics, which was aimed at tracking user-facing interactions only. The events from that spec are called “standard events” and must be included to ensure a comprehensive tracking.

These events track the user-facing interaction and should be implemented by all teams tracking multi-agent systems.

Event nameDescription

ai_user_prompt_created

User submits a question to the agent system

ai_agent_response_received

Individual agent completes and returns its output

ai_interaction_completed

Final response delivered to user, marking end of the interaction cycle

ai_user_action

User interaction with the response (feedback, copy, share)

1. ai_user_prompt_created (capture user input)

Tracks when a user submits a prompt to your multi-agent system.

When to track: Immediately after user submits their question

Implementation example:

JAVASCRIPT
// Using RudderStack JavaScript SDK
rudderanalytics.track('ai_user_prompt_created', {
conversation_id: 'conv_abc123',
interaction_id: 'int_1',// New interaction startsprompt_text: 'Build a segment of high-value customers',
input_method: 'text',// 'text', 'voice', 'button'
...
});

Key properties:

  • conversation_id: Required. Links all interactions in a conversation
  • interaction_id: Required. Unique identifier for this Q&A cycle
  • prompt_text: The user's input (consider privacy - see best practices)

2. ai_agent_response_received (capture individual agent output)

Tracks when an individual agent completes its task and returns output. Multiple agents may generate responses for a single user question.

When to track: Each time an agent completes its execution and returns results

Implementation example:

JAVASCRIPT
rudderanalytics.track('ai_agent_response_received', {
conversation_id: 'conv_abc123',
interaction_id: 'int_1',
agent_name: 'query_builder',
response_text: 'Generated SQL query: SELECT * FROM customers WHERE...',
response_status: 'success',// 'success', 'error', 'partial_success'
duration_ms: 1850,
steps_count: 3, // number of turns agent took (tool calls, thinking)
model_used: 'gpt-4-turbo',
cost: 0.018,
tokens_used: {
prompt_tokens: 150,
completion_tokens: 95,
cache_read: 245,
cache_write: 12236
},
tools_used: ['sql_validator', 'warehouse_query']
});

Key properties:

  • conversation_id, interaction_id: Links this agent response to the user interaction
  • agent_name: Identifies which agent produced this output
  • response_text: Output from this agent
  • response_status: Whether this agent succeeded or failed
  • duration_ms: How long this specific agent took
  • model_used: The LLM model used by this agent
  • cost, tokens_used: This agent's resource consumption
  • tools_used: List of tools this agent called

3. ai_interaction_completed (capture final response and summary)

Captures the final synthesized response delivered to the user, along with aggregated metrics across all agents involved.

When to track: After the orchestrator delivers the final response to the user

Implementation example:

JAVASCRIPT
rudderanalytics.track('ai_interaction_completed', {
conversation_id: 'conv_abc123',
interaction_id: 'int_1',
status: 'success',// 'success', 'error', 'timeout', 'user_cancelled'
final_response_text: 'I\\'ve created a segment with 1,247 high-value customers...',
agents_used: ['planner', 'query_builder', 'validator'],
duration_ms: 4250,
agents_count: 3,
steps_count: 8,// Total steps across all agentstotal_cost: 0.045,
tokens_used: {
prompt_tokens: 450,
completion_tokens: 380,
cache_read: 245,
cache_write: 12236
}
});

Key properties:

  • final_response_text: The synthesized response shown to the user
  • status: Overall success/failure of the interaction
  • total_* metrics: Aggregated across all agents (duration, cost, tokens, steps)
  • agents_used: List of agents that contributed to this response

4. ai_user_action (capture user feedback)

Tracks user interactions with the agent response—feedback, sharing, copying, etc.

When to track: When user provides any action on the response

Implementation example:

JAVASCRIPT
rudderanalytics.track('ai_user_action', {
conversation_id: 'conv_abc123',
interaction_id: 'int_1',
action_type: 'feedback_given',// 'feedback_given', 'copied', 'shared', 'regenerated'action_details: {
feedback_type: 'rating',
feedback_value: 5,
feedback_text: 'Perfect! Exactly what I needed'
}
});

Agent Performance & Optimization Events

These events capture the 'effort' behind the answer. They allow you to attribute costs to specific steps and identify exactly where latency is hurting user retention.

Event nameDescription

ai_agent_session_started

Agent begins execution

ai_agent_session_completed

Individual agent completes and returns its output

ai_agent_step_completed

Individual operation within an agent (LLM call, tool call, etc.)

5. ai_agent_session_started (track agent initialization)

Tracks when an individual agent starts its execution within an interaction.

When to track: When the orchestrator delegates work to a specific agent

Implementation example:

JAVASCRIPT
rudderanalytics.track('ai_agent_session_started', {
conversation_id: 'conv_abc123',
interaction_id: 'int_1',
agent_session_id: 'session_2',// Unique for this agent executionagent_name: 'query_builder',
prompt_text: 'Generate SQL for high-value customer segment with lookback_days: 30',
model_used: 'gpt-4-turbo'
});

Key properties:

  • agent_session_id: Unique identifier for this agent's execution
  • agent_name: Identifies which agent is starting
  • prompt_text: Input given to this agent (consider privacy)
  • model_used: The LLM model this agent will use

6. ai_agent_session_completed (track agent results)

Tracks when an agent completes its execution, successfully or with errors.

When to track: When agent finishes processing and returns results

Implementation example:

JAVASCRIPT
rudderanalytics.track('ai_agent_session_completed', {
conversation_id: 'conv_abc123',
interaction_id: 'int_1',
agent_session_id: 'session_2',
agent_name: 'query_builder',
status: 'success',// 'success', 'error', 'timeout'
response_text: 'Generated and validated SQL query: SELECT * FROM...',
duration_ms: 1850,
steps_count: 3,// Number of steps this agent executed
model_used: 'gpt-4-turbo',
tools_used: ['sql_validator', 'warehouse_query'],
cost: 0.018,
tokens_used: {
prompt_tokens: 150,
completion_tokens: 95,
cache_read: 245,
cache_write: 12236
}
});

Key properties:

  • agent_session_id: Links to the session started event
  • response_text: Output produced by this agent
  • status: Success or failure state
  • duration_ms: How long this agent took
  • steps_count: Operation counts
  • model_used: The LLM model used by this agent
  • tools_used: List of tools this agent called
  • cost, tokens_used: Per-agent resource consumption

7. ai_agent_step_completed (track individual operations)

Tracks individual operations within an agent session—LLM calls, tool calls, planning, validation, etc.

When to track: After each discrete operation completes within an agent

Implementation examples:

Example A: LLM call step

JAVASCRIPT
rudderanalytics.track('ai_agent_step_completed', {
conversation_id: 'conv_abc123',
interaction_id: 'int_1',
agent_session_id: 'session_2',
step_type: 'llm_call',
// LLM-specific propertiesmodel_name: 'gpt-4-turbo',
prompt_tokens: 150,
completion_tokens: 95,
total_tokens: 245,
response_text: 'Generated SQL query structure',
// Common properties
status: 'success', // 'success', 'error', 'timeout'
duration_ms: 1200
});

Example B: Tool call step

JAVASCRIPT
rudderanalytics.track('ai_agent_step_completed', {
conversation_id: 'conv_abc123',
interaction_id: 'int_1',
agent_session_id: 'session_2',
step_type: 'tool_call',
// Tool-specific properties
tool_name: 'warehouse_query',
tool_input: {
query: 'SELECT * FROM customers WHERE value > 1000',
}
tool_output: {
rows_returned: 1247,
execution_time_ms: 120,
}
// Common properties
status: 'success', // 'success', 'error', 'timeout'
duration_ms: 650
});

Key properties:

  • step_type: 'llm_call', 'tool_call'
  • Type-specific properties: Different fields based on step_type
  • Always include: status, duration_ms

Why step-level tracking matters: This is where you unlock unit economics. You might discover that 80% of your interaction cost comes from a single 'Research' step that rarely changes the final outcome—an immediate opportunity to switch models and improve margins without hurting the user experience.

Your framework for multi-agent analytics

Multi-agent AI systems represent a paradigm shift from linear AI interactions to orchestrated workflows. Without proper analytics, you're operating blind, and are unable to optimize performance, control costs, or prove value.

This guide introduced a standardized event schema that extends RudderStack's AI Product Analytics Spec to multi-agent architectures:

Standard events (ai_user_prompt_created, ai_agent_response_received, ai_interaction_completed, ai_user_action) give you fundamental metrics on system performance and user satisfaction.

Advanced events (ai_agent_session_started, ai_agent_session_completed, ai_agent_step_completed) provide granular visibility into orchestration patterns, agent performance, and step-by-step execution.

The three-level ID hierarchy (conversation_id, interaction_id, agent_session_id) connects all events in a workflow, enabling complete traceability from user question to agent execution to business outcome.

With this framework, you can:

  • Identify which agents are bottlenecks and optimize their performance
  • Choose appropriate models per agent to reduce costs by 30-50%
  • Track token consumption and cost at the agent level for precise budget control
  • Understand orchestration patterns and identify opportunities for parallelization
  • Identify cost efficiencies with complete visibility into agent sessions and steps
  • Connect agent performance to business outcomes and user satisfaction

Implementation approach: Start with Standard events to understand overall performance. Add Advanced events when you need to optimize costs or debug reliability issues. Use step-level tracking for agents where you need granular visibility into LLM versus tool call performance.

Leverage RudderStack's data infrastructure, including SDKs for event collection, Transformations for privacy-safe data processing, warehouse destinations for SQL analysis, and integrations with your existing analytics stack.

This guide is part of RudderStack's early work on AI product analytics. Share your use cases, challenges, and ideas to help shape the future of AI analytics.

Want to see how RudderStack can power your AI analytics? Book a demo to explore Transformations, warehouse integrations, and multi-agent tracking in action.

FAQs

  • Multi-agent AI analytics measures how orchestrated agent workflows behave end-to-end, including agent handoffs, parallel execution, tool calls, per-agent latency, and per-agent cost, not just the user prompt and final answer.

    Multi-agent AI analytics measures how orchestrated agent workflows behave end-to-end, including agent handoffs, parallel execution, tool calls, per-agent latency, and per-agent cost, not just the user prompt and final answer.

  • Single-agent analytics typically captures prompt, response, and overall latency. Multi-agent systems introduce routing decisions, multiple model calls, tool dependencies, and synthesis steps. Without agent-level visibility, you cannot locate bottlenecks or cost drivers.

    Single-agent analytics typically captures prompt, response, and overall latency. Multi-agent systems introduce routing decisions, multiple model calls, tool dependencies, and synthesis steps. Without agent-level visibility, you cannot locate bottlenecks or cost drivers.

  • A practical hierarchy is:

    • conversation_id for the full user conversation

    • interaction_id for one user question plus final response

    • agent_session_id for each agent’s execution within that interaction
      This structure makes it easy to attribute cost, latency, and failures to specific agents.

    A practical hierarchy is:

    • conversation_id for the full user conversation

    • interaction_id for one user question plus final response

    • agent_session_id for each agent’s execution within that interaction
      This structure makes it easy to attribute cost, latency, and failures to specific agents.

  • Start with user-facing “standard” events like ai_user_prompt_created, ai_interaction_completed, and ai_user_action. Then add agent performance events like ai_agent_session_started, ai_agent_session_completed, and ai_agent_step_completed to understand orchestration and unit economics.

    Start with user-facing “standard” events like ai_user_prompt_created, ai_interaction_completed, and ai_user_action. Then add agent performance events like ai_agent_session_started, ai_agent_session_completed, and ai_agent_step_completed to understand orchestration and unit economics.

  • Track tokens and cost at the agent and step level, then roll up totals per interaction. This lets you compute metrics like cost per successful interaction, cost per agent type, and the marginal cost of additional steps or tools.

    Track tokens and cost at the agent and step level, then roll up totals per interaction. This lets you compute metrics like cost per successful interaction, cost per agent type, and the marginal cost of additional steps or tools.

  • At minimum: step_type (llm_call vs tool_call), status, duration_ms, and step-specific fields (model name and token counts for LLM calls; tool name, inputs/outputs, and error info for tool calls). This pinpoints whether latency is compute-driven or tool-driven.

    At minimum: step_type (llm_call vs tool_call), status, duration_ms, and step-specific fields (model name and token counts for LLM calls; tool name, inputs/outputs, and error info for tool calls). This pinpoints whether latency is compute-driven or tool-driven.

  • Once you can see cost and latency by agent and step, you can right-size models per agent (cheap models for formatting/validation, frontier models for planning), remove redundant steps, and parallelize work where safe.

    Once you can see cost and latency by agent and step, you can right-size models per agent (cheap models for formatting/validation, frontier models for planning), remove redundant steps, and parallelize work where safe.

  • Treat prompt_text and tool inputs as sensitive by default. Use redaction, hashing, or intent classification before sending to downstream tools, and store raw text only where you have explicit consent and strong access controls.

    Treat prompt_text and tool inputs as sensitive by default. Use redaction, hashing, or intent classification before sending to downstream tools, and store raw text only where you have explicit consent and strong access controls.

CTA Section BackgroundCTA Section Background

Start delivering business value faster

Implement RudderStack and start driving measurable business results in less than 90 days.

CTA Section BackgroundCTA Section Background