Centralized vs. Distributed Data Architecture for AI Agents

The case for distributed metadata in the agentic era, and the hard parts that come with it.

The Data Maturity Guide

A practical four-stage guide to driving impact with customer data. Complete with case studies and implementation strategies.

Context is having a moment. And for good reason.

AI can only reason about data when it understands what the data means. A column called order_total is just a number until something tells the model whether it's gross or net, dollars or cents, pre-refund or final. An event called login is just a name until something explains where it fires, which app sends it, whether it's still firing, and which downstream systems consume it.

That something is context.

Everyone in the data stack has noticed. Data pipelines (RudderStack), modeling layers (dbt), warehouses (Snowflake, Databricks), BI tools (Hex), catalogs (DataHub)—every one of these is angling to be the context layer for the AI era. The land grab makes sense. Whoever owns the context layer becomes structurally critical to every AI workload that touches enterprise data.

Which raises the obvious question: should there be one context layer, or many?

The instinct, for most of us who grew up in the BI era, is to say one. Centralize the metadata. Single source of truth. One place to look. I think that instinct is wrong for the agentic era, but the case for it is stronger than its critics admit, and worth engaging with honestly.

What context actually means in the modern data stack

Context is the metadata that gives data meaning. But it goes deeper than the static field-level definitions most catalogs capture.

Take a behavioral event called login. The name is descriptive but it doesn't paint the full picture:

Where is it fired from? Which app, which screen, which SDK?
Is this all logins across all properties, or are there siblings (web_login, mobile_login, sso_login) you'd need to union?
When did it first fire? Is it still firing? Are there gaps where instrumentation broke?
Which destinations receive it (ad platforms, CRM, marketing automation) and what is it called once it lands there?
Has the payload schema drifted over time?

This is the context an analyst (or an agent) needs to actually use the event without making a fool of themselves. And almost none of it lives in a catalog row or a column comment.

Where behavioral event context lives (and why it matters)

For behavioral event data, the context lives at the source: in the SDKs, the pipeline, the transformations, and the destination mappings, and it's continuously changing.

RudderStack happens to sit on exactly that trace. If a customer installs our GitHub app, we can see the event from the line of code that generated it, through every transformation, to every destination it lands in. We can answer the questions above, and a lot more, directly. Connect Claude to our MCP and ask.

No other tool has this view. Not the warehouse, not the modeling layer, not the BI tool. They see the event after it has landed and been flattened. The provenance, the firing patterns, the destination semantics: that information is upstream of them by design.

You can copy this metadata into a central catalog. People have been doing it for a decade. The copy is always lossy and always lagging, because the source of truth keeps moving. The question isn't whether centralization is convenient, it obviously is. The question is whether the copy can ever be as good as querying the source. For provenance, schema drift, and destination behavior, the answer is no.

Why data catalog centralization was built for humans, not AI agents

This is the part worth sitting with.

The reason we built unified catalogs, single semantic layers, and one-throat-to-choke metadata stores is that humans are terrible at stitching information across multiple tools. Toggling between five UIs to answer one question is a productivity disaster. So we centralized, accepting staleness as the cost of colocation.

Agents do not have this limitation in the same way.

An agent can query the pipeline for event provenance, the warehouse for storage stats, the modeling tool for transformation lineage, and the BI tool for usage patterns, all in parallel, in seconds. The agent is the integration layer. The tools are the specialists.

This flips the architectural logic. In a world where humans consumed metadata, centralization made sense even at the cost of freshness. In a world where agents consume metadata, distribution starts to win, because freshness and depth matter more than colocation.

What the centralization advocates get right

I want to take the opposite view seriously, because it has real points.

The case for a centralized context layer in the agentic era isn't actually about humans. It's about three things agents are still bad at:

Reconciliation. When the pipeline says an event has been firing since March and the warehouse says the earliest row is from July, who wins? Agents handle parallel fetching well. They handle conflicting answers from authoritative-looking sources poorly. A centralized layer with explicit reconciliation logic is, today, better at this than a model reasoning over raw responses.

Governance. Access control, audit trails, lineage policy, PII tagging: These benefit from a single chokepoint, regardless of who the consumer is. Distributing them across N source systems means enforcing them N times, and inconsistently.

Latency and availability. Hitting four source systems for every agent turn is slow and fragile. Caching that the pipeline is up and the warehouse hasn't lost its mind gives you faster, more reliable answers.

These are not small concerns. Anyone who's operated a federated query layer at scale knows the cost is real.

The hybrid data architecture for the agentic era

So the honest version of my thesis is narrower than "distribute everything."

For meaning (what does this event represent, where did it come from, how is it being used), go to the source. The pipeline owns event semantics. dbt owns transformation lineage. Hex owns analytical usage. Copy it into a catalog and you get something that looks right and is increasingly wrong.

For policy (access control, governance, audit, schema contracts), centralization still wins. These are inherently cross-cutting concerns and they benefit from a single enforcement point.

For performance (caching, denormalization, materialization), do it selectively, with TTLs short enough that the cache doesn't quietly become its own source of truth.

This is closer to how mature federated systems work generally: distributed authority, centralized policy, caching as an optimization rather than an architecture.

Best-of-breed context at the source: A new data stack principle

The part of the stack that processes a category of information holds the freshest, deepest context about that information. dbt for transformations. Snowflake and Databricks for storage and query patterns. Hex for analytical usage. RudderStack for behavioral events. Each is canonical for its own slice. Each is stale and lossy when it tries to be canonical for someone else's slice.

The right architecture for the agentic era is not one context layer to rule them all, and it's not pure federation either. It's best-of-breed context at the sources, exposed via MCP or whatever the equivalent protocol turns out to be, with a thin centralized layer for governance, and agents doing the stitching for everything else.

The data stack doesn't need to consolidate for AI. It needs to expose itself well, and be honest about which parts of the old centralization argument still apply.

Published:

May 12, 2026

One context layer, or many?

The Data Maturity Guide

What context actually means in the modern data stack

Where behavioral event context lives (and why it matters)

Why data catalog centralization was built for humans, not AI agents

What the centralization advocates get right

The hybrid data architecture for the agentic era

Best-of-breed context at the source: A new data stack principle

More blog posts

Event streaming: What it is, how it works, and why you should use it

From product usage to sales pipeline: Building PQLs that actually convert

RudderStack: The essential customer data infrastructure

Start delivering business value faster

Company

Company

Products

Products

Read our documentation

Resources

Resources

Join the conversation

The Data Maturity Guide

The Data Maturity Guide