We'll send you updates from the blog and monthly release notes.
December 31, 2020
Data control is more important than ever, especially as it relates to customer data. It’s increasingly discussed by companies, often at the C-suite and board levels, and the issue is top of mind for consumers. In fact, data control is such an important topic that some companies have built their entire value proposition on top of it. But what exactly does it mean to have control over your customer data?
According to Merriam Webster, to have control (over anything) means to exercise restraining or directing influence over, or to have power over. So, what is data control? Data control is comprised of three intregral aspects:
- Data access aperture
- Data security control
- Data privacy control
By understanding and combining these pieces, organizations can tailor a comprehensive data control plan to their needs and excersise proper controls over both data access and use. We’ll look at each aspect individually first, then we’ll examine how they combine to help companies properly control their customer data. To illustrate our point, we’ll consider a concrete example. Imagine a company stores its customer data–specifically the clickstream data from their app–in four different locations:
- Google Analytics
- AWS S3
- Their own data center
Below, we’ll illustrate how each of these controlled data storage locations provides a different level of access, security, and privacy control.
Data access aperture
Data access aperture defines the various methods you can use to access and work with your datasets.
Consider the first three examples of data storage above: Google Analytics, Snowflake, and S3. In all of these cases, you’re storing the data to a cloud provider, but the level of control over how you interact with the data varies significantly.
The data stored in Google Analytics is only accessible through the dashboards that Google provides. So, you’re out of luck if you want something outside of the reporting offered by Google Analytics. You won’t have access to something that you could easily achieve with a few SQL commands on the warehouse (unless you want to pay Google a ridiculous amount of money for a dump of the raw data).
When you store the data in a Snowflake warehouse, you can interact with it via complex SQL. As a result, you can leverage the unlimited computing power of a modern warehouse for advanced analysis. On the other hand, if you want to run a Spark job, you’re out of luck (or more accurately, it would be relatively inefficient and costly to achieve via Snowflake).
In this example, S3 provides the widest aperture of access. Not only can you connect S3 to BI tools like Tableau or Looker and build analyses in SQL, but you can also load the data into Spark or your applications.
This example also makes it clear that exposure of the data to other users and information systems is, on some level, a function of access aperture: the wider the aperture, the higher the functionality around exposing the data.
Data access control also defines how much control you have over data portability. Data portability simply means how easy (or difficult) it is to take your data elsewhere, say, to a different storage or SaaS system. In our example, S3 provides the highest portability support. You can easily move data from S3 to Google Cloud, for example, all you have to do it pay the network transfer cost. At the other end of the spectrum, Google Analytics does not provide direct access to the raw event stream data, so this data is not portable.
Data security control
Data security control means having visibility and control over all access to the data.
Security is a crucial aspect of data control as evidenced by repeated headlines about data breaches in the past few years.
When you store your data with any cloud provider, whether it’s a SaaS application like Google Analytics or an Infrastructure SaaS like S3, you give up some security control. You’re limited to the access control policies supported by the tools housing your data. For example, Google Analytics only supports user-based access, while S3 supports both user-based and Identity and Access Management (IAM) based access. Furthermore, when you trust a cloud provider with your data, you’re also trusting them to implement their policies effectively.
On the other hand, when you store data in your own data center you can set up arbitrary security control policies. But remember, how effectively you can enforce these controls depends on the sophistication and bandwidth of your IT security teams.
It’s important to carefully consider what data you entrust to third-party SaaS vendors and do you research to understand their security practices, especially when it comes to sensitive data and PII. The more comprehensive and critical the data is, the more stringent your security practices and workflows should be.
Data security control needs vary by industry and company. For example, a consumer gaming startup doesn’t collect much PII on their users and likely has a small user base, meaning they can remain secure without significant effort. On the other hand, a big bank or a healthcare company might not even be comfortable storing data with a cloud provider like AWS.
Data privacy control
The relationship between aperture and exposure necessitates the third aspect of data control: data privacy control. Data privacy refers specifically to how data is used and what it is used for. While there is a close relationship between data privacy control and data security control, there is a clear distinction. Security control manages who has access to the data. Privacy control considers what the data is used for and whether or not that usage meets the end user’s expectation of privacy and/or complies with legal standards such as GDPR or CCPA.
Let’s go back to our example from above. Google Analytics, Snowflake, and S3 are all cloud providers. This means someone inside those companies technically has access to your raw data, but what they can potentially do (or are already doing) with that data varies.
For example, at least some employees at Google can likely look at the same visualizations and metrics you have set up in your Google Analytics account (the system automatically generates them from the clickstream data). More likely than not, Google uses that data to create a user profile and use it for marketing purposes. The fact that the user anonymizes the data before sending it to Google doesn’t help much if the end-user is logged into some Google property like Gmail. Google can still target the users with cookies. Cookie targeting is being addressed by many platforms, but it will continue on in some form.
In the case of Snowflake and S3, their employees probably have some level of access to your raw data, but for them to make sense of clickstream data, there would need to be some reverse engineering to understand event semantics, etc. Furthermore, they don’t have any way of tying data across customers and creating a user profile similar to Google. Also, in S3, you can store the data encrypted, which would make it even harder for someone inside of AWS to reverse engineer—they would need to understand your app, get the keys, and then make sense of the data.
This spectrum highlights the challenge of balancing aperture and exposure. It’s clear that from a vendor standpoint, S3 provides far more data privacy control than Google Analytics. The practical challenge for every business is that the tooling teams across the organization need to do their jobs and requires access data but, as we’ve seen, the level of data control afforded by these tools varies significantly. Lucky, there are an increasing number of options for ‘fully owned’ analytics and data tooling.
Balancing data control requirements
When companies consider data control as a topic, they tend to focus on one or two of the aspects mentioned above, depending on their industry, stage, and even company culture.
At a bank, the security control aspect is paramount. At a fast-growing consumer startup with a sophisticated data science team, achieving more systems exposure is critical.
Data privacy control has become an exciting topic over the last few years. While heavily influenced by industry regulations, there is significant growth in the number of developers passionate about full protection of customer data, refusing to send any data to vendors like Google. This awareness also comes from company culture—several of our customers at RudderStack have strict customer data requirements staying within their VPCs.
There is no prescription for data control that fits every company. Differing needs, beliefs, and regulations dictate placing differing levels of emphasis on each of the three aspects of data control. It’s the job of data leadership (and the data engineers who work with them) to navigate and manage this complexity.
We built RudderStack to help simplify data control
RudderStack’s foundational product design decisions focus on privacy and security to help companies take back control of their data. Our flexible, HIPAA compliant, tool provides data pipelines for data collection from every data source and data activation in every tool. Features like our data governance API and data transformations help you collect and activate data while meeting data management objectives, so you can confidently control your data across all three aspects of data control.
Founder and CEO of RudderStack