Multi-agent AI analytics: Event schema for observability

While deploying multi-agent AI systems—orchestrators, agent swarms, specialized agent teams—we faced a new analytics challenge. While single-agent systems follow a straightforward user → AI → response pattern, multi-agent architectures introduce complexity: agent-to-agent handoffs, parallel execution, tool calling, and dynamic routing.

The Data Maturity Guide

A practical four-stage guide to driving impact with customer data. Complete with case studies and implementation strategies.

Without proper instrumentation, we were flying blind.

The core problem: Traditional AI analytics capture the user's question and final answer but create a black box around everything that happens in between. You know your multi-agent system is working, but you can't answer critical questions like which agents are bottlenecks, whether you're using the right models for each agent, where costs are accumulating, or how to optimize orchestration strategies.

This visibility gap had real business consequences. Teams wasted budget running expensive models for simple tasks, missed optimization opportunities because they couldn't identify slow agents, and struggled to prove ROI when they couldn't connect agent performance to user outcomes.

So we decided to extend RudderStack's AI Product Analytics Spec with multi-agent tracking capabilities. We defined a standardized event schema that now captures both user-facing interactions and internal agent orchestration, plus practical implementation patterns using RudderStack's data infrastructure.

💡 Follow this GitHub discussion for future updates

By following the same framework, you can also track agent systems revealing optimization opportunities, controlling costs, and connecting agent performance to business metrics.

Why multi-agent systems need different analytics

Single-agent vs multi-agent: A different paradigm

Single-agent systems are linear. A user asks a question, an LLM processes it, and you get a response. Analytics is straightforward: track the prompt, the response, latency, and cost.

Multi-agent systems are orchestrated workflows. A user asks a question, an orchestrator analyzes it, routes to specialized agents (sometimes in parallel), those agents may call tools or other LLMs, results get synthesized, and finally the user receives an answer. The path from question to answer involves multiple decision points, handoffs, and processing steps.

Aspect Single-Agent Multi-Agent	API	Webhook
Flow	Linear: User → LLM → Response	Orchestrated: User → Router → [Agent1, Agent2, Agent3] → Synthesis → Response
Analytics focus	Prompt quality, response quality, latency	Agent selection, orchestration patterns, step-level performance
Optimization	Model choice, prompt engineering	Agent specialization, routing logic, parallel vs sequential execution
Cost structure	Single model cost per interaction	Variable costs across multiple agents and models

What you're missing without agent-level visibility

Without agent-level analytics, you can't measure the unit economics of your AI. This can cause various issues:

Performance bottlenecks: You see that responses take 8 seconds on average, but you don't know if that's because your query-building agent is slow, your validation agent is making redundant calls, or your synthesis step is inefficient.

Cost optimization opportunities: You're running GPT-4 for every agent, but your validator agent could use GPT-3.5-turbo and save 90% on costs without impacting quality. Your data extraction agent could use a specialized model that's 10x cheaper. You won't discover these opportunities without per-agent cost visibility.

Model selection: Different agents have different requirements. Your planner needs reasoning capability (use a frontier model), but your formatter just structures data (use a fast, cheap model). Without metrics showing which agents are cost/latency bottlenecks, you can't make informed model choices.

Orchestration patterns: Are your agents executing sequentially when they could run in parallel? Is your routing logic sending requests to the wrong specialists? Without tracking agent collaboration patterns, you can't optimize your orchestration strategy.

Tool calling efficiency: Your agents make external API calls—to databases, knowledge bases, or other services. These tool calls often dominate latency and can fail silently. Without step-level tracking, you can't identify problematic integrations.

Real business impact: With proper agent-level analytics, we discovered:

30-50% cost reduction by using appropriate models per agent
40-60% latency improvement by optimizing orchestration
2-3x increase in success rates by identifying and fixing failing agent steps
Clear attribution of outcomes to specific agent improvements

Understanding the ID hierarchy for multi-agent tracking

Multi-agent systems require a three-level identification structure to connect all events in a workflow:

Level 1: Conversation ID

The complete user conversation spanning multiple questions and answers. A user might have a 10-minute conversation with your AI assistant, asking several questions. All of those interactions share one conversation_id.

Level 2: Interaction ID

A single user question and the corresponding final answer. Within a conversation, each time the user asks a question and receives a response, that's one interaction with a unique interaction_id. This is the fundamental unit of user experience.

Level 3: Agent session ID

An individual agent's execution within an interaction. When the orchestrator routes work to a specialized agent, that agent's complete execution (including all its LLM calls and tool calls) gets a unique agent_session_id.

Real-world example: Customer Data Platform Assistant

SH
conversation_id: "conv_abc123"

│

├── interaction_id: "int_1"

│ User: "Build me a segment of high-value customers who purchased in the last 30 days"

│ │

│ ├── agent_session_id: "session_1" (Planner Agent)

│ │ └── Analyzes request, creates execution plan

│ │

│ ├── agent_session_id: "session_2" (Query Builder Agent)

│ │ ├── Step 1: LLM call to generate SQL

│ │ ├── Step 2: Tool call to validate SQL syntax

│ │ └── Step 3: Tool call to execute query

│ │

│ ├── agent_session_id: "session_3" (Results Validator Agent)

│ │ ├── Step 1: LLM call to check data quality

│ │ └── Step 2: Tool call to get sample records

│ │

│ └── Final answer: "I've created a segment with 1,247 high-value customers..."

│

└── interaction_id: "int_2"

User: "Can you add only customers from California?"

│

├── agent_session_id: "session_4" (Query Modifier Agent)

│ └── Updates existing query with geographic filter

│

└── Final answer: "Updated! Now showing 342 California customers..."

This hierarchy enables powerful analytics:

Track end-to-end user experience at the interaction level
Analyze which agents contribute to slow interactions
Understand agent collaboration patterns across sessions
Identify friction points in the user-agentic journey
Calculate costs and token usage per agent type

How to map technical execution to user journeys:

Conversation: The full customer relationship/session.
Interaction: The specific user intent or 'Job to be Done.'
Agent Session: The cost and effort required to fulfill that intent.

The multi-agent analytics spec

This specification provides comprehensive tracking for multi-agent systems at two levels:

Standard events capture the complete user journey and outcomes, and
Agent Performance events unlock granular visibility into unit economics and latency drivers. This allows you to align your data strategy with your optimization goals.

Standard events

NOTE: In this spec, we extend the original spec for AI Product Analytics, which was aimed at tracking user-facing interactions only. The events from that spec are called “standard events” and must be included to ensure a comprehensive tracking.

These events track the user-facing interaction and should be implemented by all teams tracking multi-agent systems.

Event name	Description
ai_user_prompt_created	User submits a question to the agent system
ai_agent_response_received	Individual agent completes and returns its output
ai_interaction_completed	Final response delivered to user, marking end of the interaction cycle
ai_user_action	User interaction with the response (feedback, copy, share)

1. ai_user_prompt_created (capture user input)

Tracks when a user submits a prompt to your multi-agent system.

When to track: Immediately after user submits their question

Implementation example:

JAVASCRIPT
// Using RudderStack JavaScript SDK
rudderanalytics.track('ai_user_prompt_created', {
  conversation_id: 'conv_abc123',
  interaction_id: 'int_1',// New interaction startsprompt_text: 'Build a segment of high-value customers',
  input_method: 'text',// 'text', 'voice', 'button'
  ...
});

Key properties:

conversation_id: Required. Links all interactions in a conversation
interaction_id: Required. Unique identifier for this Q&A cycle
prompt_text: The user's input (consider privacy - see best practices)

2. ai_agent_response_received (capture individual agent output)

Tracks when an individual agent completes its task and returns output. Multiple agents may generate responses for a single user question.

When to track: Each time an agent completes its execution and returns results

Implementation example:

JAVASCRIPT
rudderanalytics.track('ai_agent_response_received', {
  conversation_id: 'conv_abc123',
  interaction_id: 'int_1',
  agent_name: 'query_builder',
  response_text: 'Generated SQL query: SELECT * FROM customers WHERE...',
  response_status: 'success',// 'success', 'error', 'partial_success'
  duration_ms: 1850,
  steps_count: 3, // number of turns agent took (tool calls, thinking)
  model_used: 'gpt-4-turbo',
  cost: 0.018,
  tokens_used: {
    prompt_tokens: 150,
    completion_tokens: 95,
    cache_read: 245,
    cache_write: 12236
  },
  tools_used: ['sql_validator', 'warehouse_query']
});

Key properties:

conversation_id, interaction_id: Links this agent response to the user interaction
agent_name: Identifies which agent produced this output
response_text: Output from this agent
response_status: Whether this agent succeeded or failed
duration_ms: How long this specific agent took
model_used: The LLM model used by this agent
cost, tokens_used: This agent's resource consumption
tools_used: List of tools this agent called

3. ai_interaction_completed (capture final response and summary)

Captures the final synthesized response delivered to the user, along with aggregated metrics across all agents involved.

When to track: After the orchestrator delivers the final response to the user

Implementation example:

JAVASCRIPT
rudderanalytics.track('ai_interaction_completed', {
  conversation_id: 'conv_abc123',
  interaction_id: 'int_1',
  status: 'success',// 'success', 'error', 'timeout', 'user_cancelled'
  final_response_text: 'I\\'ve created a segment with 1,247 high-value customers...',
  agents_used: ['planner', 'query_builder', 'validator'],
  duration_ms: 4250,
  agents_count: 3,
  steps_count: 8,// Total steps across all agentstotal_cost: 0.045,
  tokens_used: {
    prompt_tokens: 450,
    completion_tokens: 380,
    cache_read: 245,
    cache_write: 12236
  }
});

Key properties:

final_response_text: The synthesized response shown to the user
status: Overall success/failure of the interaction
total_* metrics: Aggregated across all agents (duration, cost, tokens, steps)
agents_used: List of agents that contributed to this response

4. ai_user_action (capture user feedback)

Tracks user interactions with the agent response—feedback, sharing, copying, etc.

When to track: When user provides any action on the response

Implementation example:

JAVASCRIPT
rudderanalytics.track('ai_user_action', {
  conversation_id: 'conv_abc123',
  interaction_id: 'int_1',
  action_type: 'feedback_given',// 'feedback_given', 'copied', 'shared', 'regenerated'action_details: {
    feedback_type: 'rating',
    feedback_value: 5,
    feedback_text: 'Perfect! Exactly what I needed'
  }
});

Agent Performance & Optimization Events

These events capture the 'effort' behind the answer. They allow you to attribute costs to specific steps and identify exactly where latency is hurting user retention.

Event name	Description
ai_agent_session_started	Agent begins execution
ai_agent_session_completed	Individual agent completes and returns its output
ai_agent_step_completed	Individual operation within an agent (LLM call, tool call, etc.)

5. ai_agent_session_started (track agent initialization)

Tracks when an individual agent starts its execution within an interaction.

When to track: When the orchestrator delegates work to a specific agent

Implementation example:

JAVASCRIPT
rudderanalytics.track('ai_agent_session_started', {
  conversation_id: 'conv_abc123',
  interaction_id: 'int_1',
  agent_session_id: 'session_2',// Unique for this agent executionagent_name: 'query_builder',
  prompt_text: 'Generate SQL for high-value customer segment with lookback_days: 30',
  model_used: 'gpt-4-turbo'
});

Key properties:

agent_session_id: Unique identifier for this agent's execution
agent_name: Identifies which agent is starting
prompt_text: Input given to this agent (consider privacy)
model_used: The LLM model this agent will use

6. ai_agent_session_completed (track agent results)

Tracks when an agent completes its execution, successfully or with errors.

When to track: When agent finishes processing and returns results

Implementation example:

JAVASCRIPT
rudderanalytics.track('ai_agent_session_completed', {
  conversation_id: 'conv_abc123',
  interaction_id: 'int_1',
  agent_session_id: 'session_2',
  agent_name: 'query_builder',
  status: 'success',// 'success', 'error', 'timeout'
  response_text: 'Generated and validated SQL query: SELECT * FROM...',
  duration_ms: 1850,
  steps_count: 3,// Number of steps this agent executed
  model_used: 'gpt-4-turbo',
  tools_used: ['sql_validator', 'warehouse_query'],
  cost: 0.018,
  tokens_used: {
    prompt_tokens: 150,
    completion_tokens: 95,
    cache_read: 245,
    cache_write: 12236
  }
});

Key properties:

agent_session_id: Links to the session started event
response_text: Output produced by this agent
status: Success or failure state
duration_ms: How long this agent took
steps_count: Operation counts
model_used: The LLM model used by this agent
tools_used: List of tools this agent called
cost, tokens_used: Per-agent resource consumption

7. ai_agent_step_completed (track individual operations)

Tracks individual operations within an agent session—LLM calls, tool calls, planning, validation, etc.

When to track: After each discrete operation completes within an agent

Implementation examples:

Example A: LLM call step

JAVASCRIPT
rudderanalytics.track('ai_agent_step_completed', {
  conversation_id: 'conv_abc123',
  interaction_id: 'int_1',
  agent_session_id: 'session_2',
  step_type: 'llm_call',

// LLM-specific propertiesmodel_name: 'gpt-4-turbo',
  prompt_tokens: 150,
  completion_tokens: 95,
  total_tokens: 245,
  response_text: 'Generated SQL query structure',

	// Common properties
	status: 'success', // 'success', 'error', 'timeout'
	duration_ms: 1200
});

Example B: Tool call step

JAVASCRIPT
rudderanalytics.track('ai_agent_step_completed', {
  conversation_id: 'conv_abc123',
  interaction_id: 'int_1',
  agent_session_id: 'session_2',
  step_type: 'tool_call',

	// Tool-specific properties
	tool_name: 'warehouse_query',
  tool_input: {
    query: 'SELECT * FROM customers WHERE value > 1000',
  }
  tool_output: {
    rows_returned: 1247,
    execution_time_ms: 120,
  }

	// Common properties
	status: 'success',  // 'success', 'error', 'timeout'
	duration_ms: 650
});

Key properties:

step_type: 'llm_call', 'tool_call'
Type-specific properties: Different fields based on step_type
Always include: status, duration_ms

Why step-level tracking matters: This is where you unlock unit economics. You might discover that 80% of your interaction cost comes from a single 'Research' step that rarely changes the final outcome—an immediate opportunity to switch models and improve margins without hurting the user experience.

Your framework for multi-agent analytics

Multi-agent AI systems represent a paradigm shift from linear AI interactions to orchestrated workflows. Without proper analytics, you're operating blind, and are unable to optimize performance, control costs, or prove value.

This guide introduced a standardized event schema that extends RudderStack's AI Product Analytics Spec to multi-agent architectures:

Standard events (ai_user_prompt_created, ai_agent_response_received, ai_interaction_completed, ai_user_action) give you fundamental metrics on system performance and user satisfaction.

Advanced events (ai_agent_session_started, ai_agent_session_completed, ai_agent_step_completed) provide granular visibility into orchestration patterns, agent performance, and step-by-step execution.

The three-level ID hierarchy (conversation_id, interaction_id, agent_session_id) connects all events in a workflow, enabling complete traceability from user question to agent execution to business outcome.

With this framework, you can:

Identify which agents are bottlenecks and optimize their performance
Choose appropriate models per agent to reduce costs by 30-50%
Track token consumption and cost at the agent level for precise budget control
Understand orchestration patterns and identify opportunities for parallelization
Identify cost efficiencies with complete visibility into agent sessions and steps
Connect agent performance to business outcomes and user satisfaction

Implementation approach: Start with Standard events to understand overall performance. Add Advanced events when you need to optimize costs or debug reliability issues. Use step-level tracking for agents where you need granular visibility into LLM versus tool call performance.

Leverage RudderStack's data infrastructure, including SDKs for event collection, Transformations for privacy-safe data processing, warehouse destinations for SQL analysis, and integrations with your existing analytics stack.

This guide is part of RudderStack's early work on AI product analytics. Share your use cases, challenges, and ideas to help shape the future of AI analytics.

Want to see how RudderStack can power your AI analytics? Book a demo to explore Transformations, warehouse integrations, and multi-agent tracking in action.

FAQs

Multi-agent AI analytics measures how orchestrated agent workflows behave end-to-end, including agent handoffs, parallel execution, tool calls, per-agent latency, and per-agent cost, not just the user prompt and final answer.
Multi-agent AI analytics measures how orchestrated agent workflows behave end-to-end, including agent handoffs, parallel execution, tool calls, per-agent latency, and per-agent cost, not just the user prompt and final answer.
Single-agent analytics typically captures prompt, response, and overall latency. Multi-agent systems introduce routing decisions, multiple model calls, tool dependencies, and synthesis steps. Without agent-level visibility, you cannot locate bottlenecks or cost drivers.
Single-agent analytics typically captures prompt, response, and overall latency. Multi-agent systems introduce routing decisions, multiple model calls, tool dependencies, and synthesis steps. Without agent-level visibility, you cannot locate bottlenecks or cost drivers.
A practical hierarchy is:
conversation_id for the full user conversation
interaction_id for one user question plus final response
agent_session_id for each agent’s execution within that interaction
This structure makes it easy to attribute cost, latency, and failures to specific agents.
A practical hierarchy is:
conversation_id for the full user conversation
interaction_id for one user question plus final response
agent_session_id for each agent’s execution within that interaction
This structure makes it easy to attribute cost, latency, and failures to specific agents.
Start with user-facing “standard” events like ai_user_prompt_created, ai_interaction_completed, and ai_user_action. Then add agent performance events like ai_agent_session_started, ai_agent_session_completed, and ai_agent_step_completed to understand orchestration and unit economics.
Start with user-facing “standard” events like ai_user_prompt_created, ai_interaction_completed, and ai_user_action. Then add agent performance events like ai_agent_session_started, ai_agent_session_completed, and ai_agent_step_completed to understand orchestration and unit economics.
Track tokens and cost at the agent and step level, then roll up totals per interaction. This lets you compute metrics like cost per successful interaction, cost per agent type, and the marginal cost of additional steps or tools.
Track tokens and cost at the agent and step level, then roll up totals per interaction. This lets you compute metrics like cost per successful interaction, cost per agent type, and the marginal cost of additional steps or tools.
At minimum: step_type (llm_call vs tool_call), status, duration_ms, and step-specific fields (model name and token counts for LLM calls; tool name, inputs/outputs, and error info for tool calls). This pinpoints whether latency is compute-driven or tool-driven.
At minimum: step_type (llm_call vs tool_call), status, duration_ms, and step-specific fields (model name and token counts for LLM calls; tool name, inputs/outputs, and error info for tool calls). This pinpoints whether latency is compute-driven or tool-driven.
Once you can see cost and latency by agent and step, you can right-size models per agent (cheap models for formatting/validation, frontier models for planning), remove redundant steps, and parallelize work where safe.
Once you can see cost and latency by agent and step, you can right-size models per agent (cheap models for formatting/validation, frontier models for planning), remove redundant steps, and parallelize work where safe.
Treat prompt_text and tool inputs as sensitive by default. Use redaction, hashing, or intent classification before sending to downstream tools, and store raw text only where you have explicit consent and strong access controls.
Treat prompt_text and tool inputs as sensitive by default. Use redaction, hashing, or intent classification before sending to downstream tools, and store raw text only where you have explicit consent and strong access controls.

Published:

January 14, 2026

How we track multi-agent AI systems without losing visibility into agent orchestration

The Data Maturity Guide

Why multi-agent systems need different analytics

Single-agent vs multi-agent: A different paradigm

What you're missing without agent-level visibility

Understanding the ID hierarchy for multi-agent tracking

Level 1: Conversation ID

Level 2: Interaction ID

Level 3: Agent session ID

The multi-agent analytics spec

Standard events

1. ai_user_prompt_created (capture user input)

2. ai_agent_response_received (capture individual agent output)

3. ai_interaction_completed (capture final response and summary)

4. ai_user_action (capture user feedback)

Agent Performance & Optimization Events

5. ai_agent_session_started (track agent initialization)

6. ai_agent_session_completed (track agent results)

7. ai_agent_step_completed (track individual operations)

Your framework for multi-agent analytics

FAQs

What is multi-agent AI analytics?

Why are single-agent analytics not enough for multi-agent systems?

What is the best ID structure for tracking multi-agent workflows?

Which events should I track for multi-agent observability?

How do I measure unit economics for multi-agent systems?

What should I include in step-level tracking?

How can multi-agent analytics reduce cost and latency?

How do I handle privacy when tracking prompts and agent inputs?

More blog posts

Event streaming: What it is, how it works, and why you should use it

From product usage to sales pipeline: Building PQLs that actually convert

RudderStack: The essential customer data infrastructure

Start delivering business value faster

The Data Maturity Guide

The Data Maturity Guide