Data, data everywhere, and not a drop to drink

The world is grappling with how to fully harness the customer data explosion spanning the last few decades. At RudderStack, we engaged with hundreds of companies across industries and sophistication levels, and we have recognized a trend. Companies across the spectrum still struggle to best apply their data to inform decision-making, build better products, and ultimately delight customers.

In this write up, we describe the components of the modern data stack at each stage of the data maturity journey. Companies at the early stages of their data journey typically deploy a simple architecture focused on data collection and activation. On the other end of the spectrum, the most sophisticated B2C companies leverage machine learning to deliver real-time user experiences.

Here, we provide a simple framework to help you identify the right data infrastructure components for your company based on complexity of needs, level of data sophistication, maturity of existing solutions, and budget. Our framework focuses on customer data, as it's arguably the most valuable data a company possesses, and frames the data stacks along four phases:

  • Starter
  • Growth
  • Machine Learning
  • Real-Time

diagram of the data maturity journey showing each stage: starter, growth, machine learning, and real-time.

Naturally, there are overlaps among the phases since each organization has its own unique evolution path.

Phase 1: Starter Stack

diagram of the starter stack

In this phase, you’re collecting data from your websites and apps and sending it to multiple downstream destinations (such as Amplitude and Braze). Integration requests are typically lodged to IT or engineering from two teams:

  • Marketing: A marketer wants to send data to an analytics tool like Google Analytics or an ad network such as Facebook or send personalized emails to customers
  • Product: A product manager wants to get better insights on how customers interact with particular features on their application

The ad-hoc nature of these requests for point-to-point integrations creates a strain on your engineering and IT teams. This is where one-time integrations from RudderStack or Segment significantly eliminate engineering bottlenecks.

The Starter stack is right for companies at the beginning of their data journey. That is, companies with simple data use cases and limited budgets. Startups that are less than 2 years old or prior to Series B often fall into this category. The rapid adoption of this stack over the last decade has fueled the success of providers that pipe data into SaaS tools (e.g., Segment), SaaS tools that consume this data (e.g., Amplitude and Braze), and tools that provide analytics (e.g., Amplitude and Mixpanel).

Common starter stack challenges:

  • Different teams want data delivered to their preferred applications
  • Brittle data integrations create drain on the engineering team
  • Multiple SDKs slow website and app performance

Phase 2: Growth Stack

diagram of the growth stack

As your company grows, the proliferation of data destinations leads to data silos. You’ll probably hire Data Analysts to make sense of the data across a wide range of business functions. Out of necessity, your company invests in centralizing data into a warehouse. The warehouse soon evolves into the single source of truth for analytics. To answer complex questions, the Data Science team starts making use of transformations in the warehouse and sending the insights into downstream destinations.

In this phase, the warehouse becomes the center of your data stack. This warehouse-centric model catalyzed Snowflake’s success, the most successful cloud IPO to-date, along with the companies that facilitate the movement of data into the warehouse (e.g., Fivetran, Stitch Data and RudderStack.). The emergence of the data warehouse as a data source also sparked the creation of companies that move data from the data warehouse to downstream apps. The prominent examples here are Census, Hightouch and RudderStack.

Declining  warehouse storage costs mean companies can now load data into their warehouse first and then run transformations on data (instead of summarizing the data first to reduce storage costs). This led to the rise of dbt as the industry standard for data warehouse transformations. RudderStack helps you deploy your dbt models on customer data (i.e., operationalize dbt models) and pull data from the warehouse into downstream destinations via Reverse ETL.

The Growth Stack is ideal for companies with growing data complexity and commensurate sophistication in data engineering. At this stage, the organization starts hiring data analysts. This stack is common among startups 2 to 5 years old going through a series B to pre/early IPO growth phase. GitLab is a good example of a company in this phase.

Common growth stack challenges:

  • Point-to-point integrations create data silos
  • Better business intelligence and more personalized customer activation requires data transformation in the warehouse
  • Data transformed in the warehouse (e.g., customer cohorts) needs to be delivered to downstream destinations

Phase 3: Machine Learning Stack

diagram of the machine learning stack

Once the warehouse fulfills your organization’s need to answer business questions and serve as a guide for historical data, the need for predictive analytics arises.

Predictive analytics allow you to predict expected user behavior based on early signals and optimize marketing activities accordingly. For example, you might want to predict which users are at risk of churn and send them an “engagement offer” email.

In this phase, we see the emergence of data lakes, such as Databricks, that are optimized for the storage of unstructured data and machine learning workloads. In addition, a wide variety of ML ops stack companies have emerged here.

In this phase, there are two typical (idealized) workflows:

Training Flow

  1. Batch data from the warehouse and streaming data from user applications are ingested into the data lake. Features are defined using SQL or Python
  2. Features values are generated on training data
  3. An output label (e.g., has a user churned or not) is generated
  4. A model is built to predict output from features

Production Flow

  1. Feature values are generated from input data
  2. The model is used to predict on the feature label
  3. Label is synced to some downstream destination (using Reverse ETL) to take action (e.g., send an email with a discount for customers labeled as being likely to churn

At this stage, the organization is heavily investing in data engineering resources. Also, beyond the feature tuning that is done by data scientists, machine learning engineers start playing an important role to deploy and operationalize tools such as Spark. RudderStack will soon allow you to deploy your Spark models on customer data (i.e., operationalize spark) and send it to downstream applications via Reverse ETL.

Typically, companies of meaningful scale are able to invest and utilize the value of the machine learning stack because the optimization multiplier on their large customer base is often a good reason to do so. These companies also have the resources and volume of data to make substantial improvements through use of machine learning. B2C companies often transition faster to the machine learning stack because of their scale of users. However, it is not uncommon to see a large B2B business invest heavily in the machine learning stack.

Common ML stack Challenges:

  • Marketing wants to predict user behavior (e.g., likelihood to churn) for better personalization

Phase 4: Real-Time Stack

diagram of the real-time stack

For some of the most sophisticated companies that serve millions of customers, their business models require predictive model results along with insights from the data warehouse to be delivered back into the application. This means features and query results not only need to be stored in the data warehouse, but also need to be stored in an in-memory database to power applications directly with the data.

During this phase, the data starts flowing through not only the “offline” pipeline into the data warehouse but back into application services. This is often coupled with Kafka as a streaming bus (for coordination with other services) and Redis as an “online” in-memory store (to serve the application). RudderStack is able to sync this data into an in-memory store such as Redis to power your application.

An illustrative use-case is an e-commerce application that uses browsing history to compute and serve product recommendations directly in the app. Doordash, Stripe, and Uber have invested heavily to make offline and online predictions on user data and then serve them back into the application to customize the user experience. The important thing to note here is that real-time is less about real-time computation of insight – it’s about serving insights back into the application in real-time. We acknowledge that real-time might be overloaded here to a discerning engineer.

Common real-time stack challenges:

  • Marketing wants to deliver personalized experiences in real-time to the application layer (e.g., in-app product recommendations)

At this point, you are collecting the data, analyzing it and using both historical and predictive pipeline to power both your marketing, analytics and directly your application. Congrats! It is time for a cold beverage ;)

About the author
Eric Omwega
Vice President of Marketing & Operations at RudderStack

We'll send you updates from the blog and monthly release notes.