What is Identity Resolution?

Data engineers will tell you that most of the work of building a model is data preparation. Data in the real world is dispersed, irregular, and messy — exactly the chaos that machine automation is least suited to digest and leads to data silos. No data is more diverse than customer data, where a single user’s information might be spread across dozens of devices, accounts, products, and marketing campaigns.

Competitive companies seeking to create personalized experiences need to build a single view of the customer. To do this, they need to deal with this massive customer data and need the ability to clarify this mess of data points into an accurate model or view of the customer behavior and brand relationship. The most important step in this data setup is called identity resolution, a term that encompasses the modern methods used to render the ambient noise of an internet-connected society into a coherent signal ready to deliver profit.

Identity resolution is the process of assembling a distributed mosaic of behaviors from a given customer (spread across social media, browsing history, brand engagement, etc.) into a concrete datapoint useful for sales leads, market research, or resale. The mix of data is initially composed of a company's personal relationship with that customer, stored in customer relationship software. This starting point can be supplemented by third-party services that specialize in the detective work involved in connecting device IDs, geolocations, or purchasing behavior. Ultimately, the identity resolution process should give your company an accurate view into all individual customer relationships, enabling not only customer decision-making but also high-level, strategic insights from fine-tuned demographic knowledge.

Ultimately, identity resolution leverages the interconnected web of modern life to supercharge your business and deliver an instantaneous personal relationship with every one of your customers.

How does identity resolution work?

Identity resolution starts with an identity graph. In certain fields of mathematics, a graph refers to a network of nodes connected to one another by lines (like a subway map, or a spider web). Identity graphs sew individual scraps of a single customer’s information into a quilt that represents their whole identity. If one device is frequently linked by Bluetooth to another, and both are logged into a given account, an identity graph would represent connections between those two device IDs and the account's email address. This lets you develop a complete picture of an individual customer.

Simple databases are not an efficient way of storing complex graphs, and as online footprints become larger and more complex, it has become necessary to shift away from simple database solutions to more complex representations of data. Data warehouses are a good tool for storing the sensitive and high-demand data in an identity graph.

Deterministic or probabilistic matching

The difference between first- and third-party data is always relevant in issues of customer data. In the case of identity resolution, your data’s provenance can have a deep impact on the type of identity resolution that can be performed. Loosely speaking, first-party data (directly from customers to your databases) allows for more comprehensive identity resolution, whereas third-party data (large-scale anonymized data from a vendor) enables a more nebulous version of identity matching.

Deterministic identity matching

Many of the classic cases of identity resolution fall into the category of deterministic identity resolution. Deduplication and device stitching are necessary to get a clear view of everything from customer journeys to ad campaign efficacy. Clickstream data resolution, which retroactively matches the actions a user took to an identity provided later, is a crucial tool for observing onboarding and understanding the impression your brand makes.

All of that utility is generated by first-party data — data collected from a customer and stored only for your own company’s purposes. By holding a microscope to user behaviors, you gather “known identifiers,” useful data which allows for a highly personalized relationship with your customers.

An example of this in action would be a user visiting your website and receiving a certain anonymous ID. If that user signs up for an account or makes a purchase, this anonymous ID is now connected to an email address or a phone number or an internal user ID. When that same user visits your website from a different device like a smartphone, they would receive a new anonymous ID and initially appear as two unique users. If the user logs into their existing account, their new anonymous ID is now also connected to the same email address or user ID. In deterministic matching, one can now reliably connect the first anonymous ID to the second one and an identity graph is forming.

Probabilistic identity matching

Unlike deterministic matching, probabilistic matching draws conclusions about likely customer identity using suggestions from non-deterministic data sources. This might be device co-location suggesting shared usage, ip address or digital fingerprints to help narrow device identity, or fuzzy matching algorithms that collate additional ambiguous variables into a sensible guess about user identity.

In many cases, the raw data needed to effectively use probabilistic matching comes from a third-party vendor. Third-party data typically arrives in large volumes, which makes guesses based on data distribution easier and more reliable. Vendors like Liveramp Identity or Softcrylic are brands you may be familiar with in this field.

Neither approach is necessarily better for identity resolution, but deterministic matching covers most use cases and is generally cheaper (because you already own the raw data). Deterministic matching is also a more conclusive approach to identity resolution, when data security or precision is an important requirement.

For example, if your organization uses identity resolution for compliance purposes with privacy legislation, “accidentally” merging customer profiles based on probabilistic matching could cause irreparable harm. That being said, when handling large-scale identity needs, especially with anonymous data (which deterministic matching handles poorly), probabilistic solutions might be the better or even only feasible approach. It all depends on what type of error is worse for your business: missing out on potential opportunities by focusing on deterministic matching, or potentially making the wrong conclusions and taking wrong actions based on imperfect probabilistic matching

All customer information matters

The most basic function of identity resolution is filtering out duplicate data points from various devices and accounts — replacing anonymous device IDs with a human customer. This means the most fundamental data in the identity graph is the ID tag associated with the device, account, network, session, transaction, or other anonymous identifier that can engage with your company. Once you’ve collected these and associated them with a single customer identity where possible, your essential customer data becomes more reliable and you can move on to higher goals.

At this point, cookies, demographic information, geolocation, and other personal data become relevant. These are the details that will personalize your product outreach and facilitate the customer's journey. Most of these tidbits are warehoused and delivered to you by large identity resolution services, but it’s likely that you will also generate customer details in the course of your business relationship. All such details are valuable — with automation from a data warehouse or proprietary resolution software, ads and services can be updated in real time, supported by instant machine inference. Properly implemented, the customer relationship evolves in real time, resulting in more pleasing customer interactions and a more profitable revenue stream for the business.

Not only do the fine-grained details yield bounty for the individual customer, but the identity resolution process also serves to standardize and sanitize incoming customer data. This makes large-scale modeling and analysis much easier, giving you an unprecedented view of your entire audience. Especially important here is the efficient filtering of duplicate IDs. If buckets of redundant data can be reliably reduced to unique humans or households, any census of customers gains explanatory power.

This kind of de-duplication is not limited to activity in your warehouse. Properly implemented identity resolution can enable your organization to clean data in source systems automatically. This has the benefit of preventing issues where one customer receives the same (or sometimes even worse, a set of different) marketing or sales emails because of a duplicate entry in an email tool or CRM system.

There is a snowballing effect with identity resolution: the more customer data is available, the more precise your product and outreach will be. This builds customer trust and deepens your relationship, in turn generating more customer data. Therefore, in the end, almost all customer details become relevant to the project of identity resolution.

Which product should I use for identity resolution?

The specific details of identity resolution can vary greatly from company to company. In the first place, you must consider how much customer data your firm has exclusive access to. Data is gold, and you may not want to share it with a third-party system, preferring instead to operate your identity resolution in house with an integrated data warehouse or similar. In general, the rule of thumb is to begin with deterministic matching to start, and then explore other options if your system has sufficient volume.

On the other hand, if you don't have sufficient data to generate demographic insights or filter redundancies, you should certainly look at third-party support. As identity resolution increases in power with more data, these services allow you to maximize your leverage in the data marketplace. Their expansive profiles can quickly render relevant demographic information when your own might be scarce.

Another consideration is laws regarding data privacy e.g. GDPR, CCPA, etc. Many jurisdictions have existing regulations on personal data, and further laws are likely as the internet becomes ever more a part of our lives. Whether within your own company or contracting another, it is important to keep in mind that privacy must be respected and tools should be flexible enough to accommodate any changes to the scope of privacy regulations.

Identity resolution — a powerful tool for a digital age

Like Archimedes with his lever long enough to move the world, we have access today to unprecedented market leverage in the form of identity resolution. Through careful organization of existing data, we can apply a real-time microscope to our commerce and understand both small and large scale implications of every customer interaction. Thanks to modern data processing systems from enterprise cloud data warehouses, to customer data platforms and beyond, identity resolution lets us not only understand individual customer interactions but also react to them, often in real time, in a way that is tailored to each individual customer. And this leads to better customer experience and hence increased retention.

If you want to dive deeper into the cutting edge of data commerce, our learning center helps you unpack concepts like:

Get the Data Maturity Guide

Our comprehensive, 80-page Data Maturity Guide will help you build on your existing tools and take the next step on your journey.

Build a data pipeline in less than 5 minutes

Create an account

See RudderStack in action

Get a personalized demo

Collaborate with our community of data engineers

Join Slack Community