We'll send you updates from the blog and monthly release notes.
August 5, 2022
It’s important to have a scalable layer to get data from point A to point B using data integrations. This is true wherever you are on your data maturity journey, it just gets more complex as you progress. The problem is, building this layer for scalability and sustainability is difficult.
Teams want to glean valuable information from all types of data, and they typically aren’t worried about technical debt, they just want access to their data–stat! Different teams want data from different sources, in different data formats, for different use-cases, and they all use different business applications.
Sales and marketing might want attribution data from complex user journeys to understand where customers are coming from, and finance needs data from various systems to ensure that the economics of your company's strategy work. Each of these examples are fairly complex on their own, and they’re only a small sample of potential data integration challenges you might encounter. Even small organizations can have a dozen disparate sources, all requiring some type of integration solution.
Plus, companies are now looking to go beyond just these types of numbers–they’re looking to operationalize analytics data to drive optimizations. Without a thoughtful data integration strategy, this means point-to-point integrations in every direction.
Data integration challenges
It takes a lot of work to integrate data from these disparate sources, but nobody said data management was easy. While you’re building your data infrastructure, you’ll likely run into several common data integration challenges. Do any of these sound familiar?
- APIs constantly changing their end-points and data schemas
- Your company changing business processes and replacing an underlying system (think CRM or ERP), leading to a need for new integrations
- Ad Hoc requests from various teams for new columns and other small tasks that can create a massive backlog
It’s not uncommon for data teams to find themselves redesigning integrations in a never ending fight to keep up with business needs. All thanks to unforeseen data integration debt.
What is integration debt?
You don’t know what you don’t know, and what you don’t know can hurt you. That's a big part of data integration debt. Unforeseen changes kill a lot of a team's productivity. But integration changes don’t just bog down backlogs, they can come with many consequences. For example, changes to how the data is modeled can cause you to make bad assumptions when building out your data warehouse.
The data integration process is fraught with pitfalls. Still, when we first build data integrations, it’s tempting to take the custom route because it allows us to start immediately instead of looking at different integration tools and then getting stuck waiting for budget approval.
Initial paths to integrate
I often see companies try to build integrations internally when they first begin thinking about a data integration solution. This applies to building both standard ETLs or streaming events.
So why not just build it yourself?
Like I said, there’s no need to talk to procurement or get stakeholders involved, and it feels good to start immediately. Just begin building integrations through some python scripts and a few consumers waiting for Kafka to publish events.
Easy right? As long as you have a few strong developers it's not a problem.
Developing internal integrations gives you the ability to develop your own infrastructure that isn't locked into a vendor or limited by no-code/low code integration platform stipulations. So, in theory, you can deliver whatever functionality your stakeholders are looking for.
Great. In no way could this possibly go wrong.
What you don’t see - cons to custom building data integrations
The problem with building your own integrations is that things can go wrong. There are a lot of issues and future work you might not be considering. Whether you’re creating standard ETL integrations or dealing with events, it’s good to approach your data integration strategy with a healthy contemplation of Murphy's law.
Change is a constant when it comes to building integrations–change in the API, schemas, even change in the underlying system. Here are a few examples of what I’m talking about:
Sudden migrations - If you’ve worked for a data or enterprise engineering team, you’re more than accustomed to migrations. Migrations in underlying systems, migrations in tooling, migrations in just about anything and everything. Any change in your company's system means you will need to spend time updating your custom built integrations. Perhaps the new ERP system only operates via sending out CSV reports vs. the prior system which relied on APIs. This means extensive rebuilds and, of course, re-modeling of your data.
APIs are deprecated - Over time API versions get deprecated. It’s inevitable. This includes authentication as well as end-points. When it happens, you may have to pause current projects and high-ROI initiatives to put out the fire started by the sudden API change.
The future developer - The future developer may not have the same skill set or want to invest time in maintaining your code. I’ve come across this in my consulting time and time again. A previous engineer or team of engineers built a system, left, and now a new hire needs to manage it. But for one reason or another the new engineer either doesn’t have the time or doesn’t want to spend their weekends maintaining the previous system.
Events change - A common issue data engineers and system developers have to deal with is events changing from their source applications. This often happens when software engineers make changes to events to add or remove features. In many cases software engineers won’t communicate these changes, and if they do it’s often on the day of the change. This leads to either a failure in the event streaming pipelines or standard integrations. All of which creates further data integration debt.
How to avoid data integration debt
To overcome these data integration challenges, you need a scalable and sustainable data integration layer. Luckily you have options. With open source technologies and flexible off-the-shelf tooling, it’s easier than it used to be to create a data integration framework that reduces the integration debt you will take on. There are, of course, many considerations and tradeoffs. This is especially true when it comes to managing events because events require a particularly robust framework or a powerful low code solution. But when you do the work up front, you’ll be able to deliver on those demands for data access with velocity, and stay focused on your current projects. Stay tuned for part two where I’ll walk you through a better way to approach integration management.
Seattle Data Guy, Data Science and Data Engineering Consultant