Blogs

Your AI/ML success starts with data quality

Written by

Matt Kelliher-Gibson

Technical Product Marketing Manager

Machine learning has the potential to create enduring competitive advantage. However, there’s a painful gap between the hyped outcomes and the ground reality of achieving them. Companies have been placing big bets on AI/ML to enhance and validate their data infrastructure investments, but many struggle to see a positive return.

Leadership, seeing the rising costs of data collection and storage while watching AI/ML efforts stall, is growing frustrated. Data teams are under more scrutiny than ever because leadership is now razor-focused on data activation for business use cases – recognizing that data investment without it is only a cost.

I’ve felt this pain firsthand as a data leader and scientist. Through ups and downs, I’ve learned success with machine learning always boils down to superior data quality. You’ve got to solve for data quality before your data gets to the data science team. Otherwise, they’ll be stuck trying to solve an endlessly growing list of data quality issues on their own.

If you’re a data leader struggling to deliver AI/ML wins, it’s easy to focus on the wrong issues. In this post, I’ll highlight the pitfalls that distract us from the core issue – data quality – and share best practices for solving the problem at the source. I’ll also detail how solving for data quality frees up your data scientists to do their best work.

Misdiagnosing the problem with lackluster AI/ML results

When you look at stalled or unsuccessful AI/ML projects, it’s tempting to conclude that the problem is around the model-building process. This thinking is easy to fall into because there often are issues in and around the model-building process, but I’ve learned that these are rarely the root cause of unsuccessful projects. Here are two of the common pitfalls I’ve seen.

First, it’s easy to get distracted by shiny new tools. Always keen to recognize pain points as opportunities, when the marketplace saw companies struggling to produce AI/ML wins, we got an explosion of tools focused on the model-building process. We now have tools focused on automating and streamlining every phase of the AI/ML life cycle:

EDA
Feature engineering
Modeling training and evaluation
Model retraining
Model deployment and monitoring

These tools are valuable for maturing your data science function but don’t address the core problem – poor data quality. I’ve seen them left vastly underutilized as teams struggle to get AI/ML projects off the ground.

Second, it’s easy to look at companies that have had massive success on the back of machine learning and conclude they must have either superior talent or special knowledge about the model-building process. This assumption can lead you to hire data science talent away, thinking the right talent can kickstart your efforts. However, hiring new talent also fails to address the core problem – poor data quality.

"Look at all the different architectures of the models coming out. There's not a lot different, actually. They're just trained on better data or more data."

—Mark Huang, Chief Architect at Gradient from The Data Stack Show

When it comes down to it, it’s not about the tools, talent, or frameworks. It’s all about the data. Lackluster AI/ML results stem from a problem that originates well before the data science team takes over – poor data quality. Here’s how it can happen.

How data science gets stuck

With pressure to become data-driven as quickly as possible, many companies begin collecting as much data as possible, regardless of quality or structure. But just like automakers can’t use iron ore straight from the mine to make the steel body of a car, data scientists can’t turn raw data into profitable AI/ML products. Data science and AI teams instinctively know this, but often it is not accounted for until the project is handed off to them, so they are stuck trying to squeeze it into an already tight project plan.

Playing the hand they’re dealt, they labor, trying to hack together just enough quality data to get a model build process started. While building a production data cleaning process for their rough code, they wrestle with siloed data, making notes like “adding in later iterations” and “needing to flag data engineering.” But no amount of work changes the reality that they’re playing a bad [data] hand. Bad data lowers the chances of a successful AI/ML project before it begins and dooms many from the start.

Dealing with bad data does take work, but driving toward superior data quality across the data activation lifecycle always pays dividends. The downstream benefits, most pronounced for AI/ML work, far outweigh the initial setup and maintenance cost. In fact, incorporating data quality measures is one of the best ways to shorten time to value for customer data. So, if superior data quality is the goal, let’s take a look at what it is and how to achieve it.

"When you have the power of RudderStack in hand, you can blast off right away. It’s so much easier to build a machine learning model once your designs are driven by clean data, useful user features, and 360 customer views."

—Wei Zhou, Wyze Director of Data Engineering

The components of superior data quality

Superior data quality comes from data with three important characteristics. Your data must be clean, complete, and current to achieve superior data quality.

Clean data is accurate without errors, consistent in structure and data type, relevant to the data science goal (with no data leakage or data that is unavailable at the decision time), and structured for the desired AI/ML workflow.
Complete data has all the relevant data joined together to give a comprehensive view of the target (e.g. customer).
Current means that the data set you’re working with is up to date and that all your incoming data is clean and complete, not just the original data used for training. So, superior data quality is an ongoing process and commitment, not a one-time event.

Life with bad data

When your data science team is forced to work with bad data, your ML projects will underperform at best. At worst, they’ll never get off the ground. Beneath these negative outcomes, your data science team is probably struggling with the consequences of bad data in obvious ways and some that are more subtle.

The obvious consequence is that bad data forces your data scientists to spend their time on data quality instead of model exploration and development. This might look like:

Chasing down data from silos across the organization and trying to get access to as much as possible (even if only a CSV extract)
Trying to figure out how or if they can connect the data together
Reformatting and transforming the data for the project/model

When data quality isn’t solved upstream, your data science team will get stuck in an endless cycle of discovering new data quality issues and manually attempting to fix them. Here’s a small sample of the type of issues that create this painful, ad hoc work:

Incorrect coding/mapping in source systems
Open text fields for identifiers
Duplicate records from incorrect joins
Future data leaking into training

In spite of their best efforts to create superior data quality on the fly, there will be gaps (missing data sources), shortcuts, and compromises (deleting poor-quality records). That will lead to stalled model development or underperformance.

The worst part about this for you is that these issues are hard to explain upstream. Leadership can get the impression that nothing is happening as deadlines continue to slide.

There’s also a more significant and subtler problem you’ll eventually encounter. When data science teams get bogged down with data quality issues, pressure to show some kind of value increases, so they begin to prioritize projects based on data availability instead of impact and business need.

These compromises manifest in different ways but always involve settling for suboptimal outcomes. Here are a few examples of the types of decisions bad data can drive:

Making a calculated choice not to bring in siloed data – like behavioral data stuck in a SaaS platform – because of the difficult nature of the work involved.
Settling to answer a less impactful question because it has more easily obtainable data. For example, predicting the likelihood an account is inactive instead of the likelihood it will become inactive.

The issue here is not that your data science team is incapable or lazy. On the contrary – they may be brilliant folks burning the candle at both ends. The problem is they’re not the ones best suited or best positioned to solve the problem. Let’s go back to the automotive example.

The most skilled workers in the most state-of-the-art facility could not produce a quality vehicle if they had to start with raw materials. Nor could they make a quality vehicle if they had to start with bad or wrong parts. To do the work they’re capable of and fully utilize their state-of-the-art tools, these workers must begin with correct, quality parts. Similarly, for data scientists dealing with bad data, the solution is not to try harder. It’s to solve the problem upstream.

Learn how Wyze increased AI/ML productivity 3X with RudderStack

Watch our webinar with Wyze Senior Data Scientist Pei Guo and Director of Data Engineering Wei Zhou to learn how they use RudderStack to ensure data quality and streamline the handoff from data engineering to science, enabling their customer engagement team to ship 3X the number of AI-driven campaigns.

The remedy around the warehouse

If you want to get strong performance from your AI/ML models, the solution is simple: You have to make superior data quality a priority, and you have to own it upstream. This will take the handcuffs off of data science so they can focus on solving business problems instead of prepping and cleaning data. It’s the only way to create an efficient data science function that can ship impactful models quickly and embrace experimentation.

The pathway to superior data quality is known and simple in concept. First, you need to get all of your data in the same place – your data warehouse or data lake – where it’s complete, accessible, and easy to work with. You also have to stop bad data before it pollutes your data lake/warehouse because once your warehouse gets polluted it’s almost impossible to fix or eliminate. Finally, you need to standardize the data internally and align it to your business’s internal operating model and strategy.

Historically, the difficulty in achieving superior data quality has been in prioritization and execution. However, there is good news. The goal of superior data quality is more attainable than it has ever been:

We’re finally addressing data silos with the tooling shift towards the data warehouse and away from walled SaaS platforms.
More and more tools are focused on catching and correcting low quality data before they hit the data warehouse.
Data modeling, a function that was once thought obsolete in the modern data stack, is making a return.

A whole new world

When you take the data quality handcuffs off of your data science team, you free them up to get closer to the business and bring their data science skills to bear. With more bandwidth, they can begin shipping models that make an impact, helping position the data org as a value creator, not a cost center. In my experience, there are three primary areas these teams should focus their energy.

First, they can proactively ensure the correct question is being asked and answered. Translating from a business problem to an AI/ML problem can be difficult. It’s easy to start a project off by only a few degrees and end up solving the wrong problem. For example, the business goal might be to predict churn likelihood. Without proper upfront understanding and alignment, a data scientist could sample currently churned and active accounts to make their model — forgetting to account for timing — and not including data on what the accounts looked like before they churned. Focusing on alignment up front and identifying the correct datasets prevents these issues and refines the project goals and objectives.

Next, with the data quality burden offloaded, data science can spend time ensuring model accuracy. Especially for AI use cases, models need to have superhuman accuracy. The largest impact on model performance comes from ensuring the training data has proper coverage of all possible outcomes and edge cases. With major data quality issues no longer taking up time, data scientists can look for where predictions are off by large margins and adjust the training data to accommodate. Higher accuracy means smaller gaps in incorrect predictions, which makes the model better and increases stakeholder confidence.

"AI/ML can define more features and train and deploy models a lot more efficiently. It enables rapid experimentation which is the key in our business growth and revenue growth."

—Wei Zhou, Wyze Director of Data Engineering

Finally, when you solve for superior data quality, your data science team can take advantage of the compounding nature of faster development cycles. Data science is won through iteration, experimenting, and learning. The biggest advantage companies that are winning with AI/ML have is not in superior algorithms but in years or decades of continuous improvement and experimenting with their proprietary data. The compounding, multiplying effect of small experiments and improvements over time creates an unreplicatable mote.

You can deliver strong AI/ML results with superior data quality

If you’re a data leader or data scientist under pressure to prove more ROI from AI/ML but struggling to drive the results you know your team is capable of, there’s a good chance you’ve got a data quality problem. The pressure for velocity can be overwhelming, so it’s tempting to just put your head down and try harder, but you can’t jump directly from collecting raw data to AI/ML. Working with bad data will always only bring more frustration and disappointing results. You need superior data quality to deliver the transformational work you know your team is capable of. The first step is simple: You have to decide to make a commitment to data quality and do whatever it takes to ensure it starts upstream. It will take work and collaboration, but there’s never been a better time to start. And while chasing shiny new tools can be a distraction, there are tools that can reduce friction and streamline your workflow. The AI/ML team at Wyze increased productivity 3X and made the handoff data engineering smoother after implementing RudderStack.

Poor data quality is a solvable problem. When you begin driving towards superior data quality, data science can spend more time aligning with the business, optimizing models, and focusing on continuous experimentation and improvement. So, stop spinning your wheels and start driving towards superior data quality today. Check out our page on data quality to see how RudderStack can help.

November 17, 2023

Company

Get the newsletter

Subscribe to get our latest insights and product updates delivered to your inbox once a month