Scaling data products starts with fixing the foundation: Five lessons we’ve learned

TL;DR
- Most scaling failures come from weak data foundations, not fancy models.
- Treat pipelines like products: owners, backlogs, SLOs, and versioned changes.
- Put one accountable data owner over a governed, cloud-first pipeline.
- Standardize “the plumbing”: tracking plans, schema enforcement, transforms-as-code, drift detection, replay.
- Design backward from decisions; route one high-quality stream to every tool.
- Keep humans in the loop with observability and explainability.
- Result: faster iteration, lower toil, and data products that actually scale.
What a solid data foundation looks like in 2025 and beyond
AI-driven features, real-time engagement, and cost pressure have raised the bar: Data products don’t fail because the model is bad. They fail because the plumbing is brittle.
McKinsey said as much earlier this year: Teams chase dashboards before fixing access, trust, and reusability. We agree, and we’ll go a step further to define what “foundation” means in practice. It starts with standardized event schemas and tracking plans, schema enforcement at ingestion to prevent drift, and transformations as code (versioned, testable, reviewable in Git).
Add cloud-first identity resolution and profiles built in your data cloud so every team works from the same customer view. Make pipelines observable end-to-end with lineage, delivery health, drift detection, and dead-letter + replay, so incidents are diagnosable and reversible.
Finally, deliver one governed stream to every destination (analytics, activation, reverse ETL, and ML) with clear SLOs for latency and delivery success. McKinsey’s five lessons map cleanly onto this reality. In the sections that follow, we’ll show how RudderStack customers made these fundamentals the default, so treating data like a product, assigning ownership, and scaling reliably became easier, faster, and far less risky.
Lesson 1: Treat your pipeline like a product
McKinsey’s recommendation to treat data assets like products, with owners, roadmaps, and lifecycle management, is spot on. But it’s hard to do that when inputs are scattered across custom scripts, brittle integrations, and one-off ETL jobs. RudderStack helps data teams turn pipelines into reusable infrastructure, so changes are intentional, reviewable, and reversible, instead of heroic one-offs.
A good example is Joybird, a company that offers customizable, handcrafted furniture designed to fit personal styles, with eco-friendly materials and sustainability in mind. Joybird retooled their customer data stack using RudderStack for real-time event collection and routing, Snowflake for the data cloud, and Iterable for email automation. Standardizing on RudderStack turned what used to be weeks of engineering work into minutes. The result is a pipeline that behaves like a product: consistent, observable, and easy to iterate.
Bonus: Their data engineering team reduced time spent building integrations and managing pipelines by 93%.
Treat your pipeline the way a product team treats a user-facing app. Assign a single owner who maintains a living backlog (e.g., new events, identity rules, destination requests, and fixes) and publish service level objectives so everyone knows what “good” looks like.
Useful SLOs include p95 event-to-destination latency (for example, under five seconds), delivery success rate (for example, above 99.5%), schema drift rate (trending toward zero), and time to add a new destination (measured in hours, not weeks).
Run every change—tracking plan updates, transformations, and mappings—through Git pull requests (PRs) with code review and CI checks. Cut versioned releases of the tracking plan and transformations, and ship release notes so downstream teams can plan safely. When you need to retire fields or events, set a clear deprecation window and provide a migration path to avoid breaking models and campaigns.
If you’re starting from ad-hoc scripts, copy this approach in one sprint:
- Publish a tracking plan for your top 10 events and enforce it at ingestion
- Move transformations into code (JavaScript, Python, or SQL) under version control
- Add CI checks for schema validation, types, and required properties
- Stand up a dead-letter queue with replay so bad events don’t silently fail
- Adopt three SLOs—latency, delivery success, and drift—and review them weekly
- Begin lightweight release notes tied to tagged versions in Git.
To keep changes disciplined, use a simple acceptance template whenever you add or modify an event, including
- Why the event exists and what decision it enables
- The owner who approves future changes
- The schema, such as name, properties with types, required/optional, examples, and any PII flags
- Validation behavior on mismatch
- Routing targets (analytics, engagement, warehouse, ML)
- Expected impact on throughput/latency and which monitors you’ll add
- And explicit rollback steps if the change needs to be reverted.
Avoid the classic anti-patterns: unreviewed schema changes pushed straight from application code; “one-off” hotfix scripts that bypass the pipeline; multiple spreadsheets with no single source of truth; and identity rules scattered across services instead of centralized and versioned.
Finally, measure it like a product. Track p95 latency, delivery success rate, schema drift rate, identity match rate, time to add a destination, and mean time to recover from pipeline incidents.
When these metrics improve and stay there, you’ve moved from a fragile set of jobs to a durable product the whole company can trust.
Lesson 2: Put one owner over your customer data
Data product success requires clear ownership. But that owner can only be effective if they have access to clean, complete, and timely data.
RudderStack customers often centralize customer data into their warehouse (e.g. Snowflake) and build their Customer 360 from there. For example, Shippit, Australia's leading multi-carrier shipping software, used RudderStack to finally unify fragmented event streams and solved a four-year attribution nightmare. Their data lead became the owner of a trusted pipeline that serves all teams, from marketing to finance, without conflicting versions of the truth.
As Nitt Chuenprateep, Shippit’s Head of Data and Analytics, told us: “Finally, there’s a way for us to bring all of the data together, to marry revenue data from the application with activity data, and to actually govern it from one central source. It’s something we’ve tried to have for years.”
Shippit’s data team envisioned a world where “you are all able to look at everything together," as Nitt told us. "And that’s what RudderStack Profiles delivered."
Lesson 3: Standardize the plumbing with tracking plans and transformations
McKinsey notes that without consistent data architecture, teams duplicate effort and struggle to scale.
This is where RudderStack’s Event Stream and transformation engine helps most. Companies like Kajabi moved from Segment to RudderStack precisely because their old stack couldn’t support consistent schemas or easy debugging. With RudderStack, they get version-controlled transformations, schema validation, and a shared foundation that everyone can build on.
Lesson 4: Design backward from decisions, then route everywhere
Data teams often build pipelines before knowing how the data will be used. RudderStack flips that by letting teams route the same high-quality stream to multiple destinations: analytics, activation, reverse ETL, even machine learning models.
This flexibility means you can iterate fast. If marketing needs a new behavioral signal in Braze, or the ML team wants more granular product events, the data is already flowing. You just route it.
Lesson 5: Keep humans in the loop with observability
Scaling data products requires not just automation, but trust. Teams need to know how the data was collected, transformed, and delivered.
RudderStack helps here by making pipelines observable and transparent. Data owners can trace every event, inspect payloads, and debug transformations without writing custom logging logic. In regulated industries or large enterprises, that level of visibility isn’t a nice-to-have. It’s required.
Putting it all together: Fix the data layer first
Every McKinsey recommendation assumes one thing: that your data foundation is solid. Without trustworthy pipelines, product thinking and ownership structures fall apart.
That’s why RudderStack focuses on helping data teams build reliable, real-time infrastructure that they control. Because when the foundation is right, scaling the rest gets a lot easier.
Explore how RudderStack can help you standardize your data infrastructure. Book a demo
FAQs
What does “fix the data foundation” actually mean?
Standardize event schemas, enforce them at ingestion, centralize identity resolution, and make all pipeline changes version-controlled with drift detection and replay—before you invest in new models or tools.
How do I choose a single data owner without creating a bottleneck?
Assign one accountable owner for governance and quality, with clear SLAs and a RACI. Keep implementation decentralized via PRs to tracking plans and transformations.
Learn more about eliminating data bottlenecks with RudderStack
What’s the fastest way to standardize our “plumbing”?
Publish a tracking plan, turn on schema validation at the edge, move transformations into code (JavaScript/Python/SQL), and add a dead-letter queue + replay. Most teams can pilot this in a single product area first.
What metrics show we’re ready to scale data products?
p95 event-to-destination latency, delivery success rate, schema drift rate, identity match rate, and time-to-add a new destination or property.
Can we do this if we’re mid-migration from Segment or Snowplow?
Yes. Start by mirroring events into a governed tracking plan, run transformations side-by-side, and cut over destination by destination to reduce risk.
Learn more about migrating to RudderStack
Published:
November 11, 2025

Event streaming: What it is, how it works, and why you should use it
Event streaming allows businesses to efficiently collect and process large amounts of data in real time. It is a technique that captures and processes data as it is generated, enabling businesses to analyze data in real time

RudderStack: The essential customer data infrastructure
Learn how RudderStack's customer data infrastructure helps teams collect, govern, transform, and deliver real-time customer data across their stack—without the complexity of legacy CDPs.

FiveTran and dbt Labs merger: A new giant in the modern data stack
The Fivetran and dbt Labs merger combines ingestion, transformation, and activation into one stack. It reshapes the modern data landscape and signals a move toward unified, AI-native infrastructure for data-forward teams.







