Make your downstream analytics low-maintenance and scalable with standardized schemas and dbt modeling
In this webinar, Eric Nelson, data engineer and analyst at Mattermost, shares a better way to clean and combine your customer event data from every source. Traditional methods involve constantly updating SQL or building inside of BI tools, which silos data.
Eric’s method uses RudderStack’s standardized event schemas and advanced dbt modeling to automatically generate clean tables for detailed clickstream reporting. The best part is that the dbt models are designed to automatically process and include new schemas as they show up in the event stream.
Here's what Eric will cover:
The problem: customer data is messy
How companies solve this and why traditional methods are sub-optimal
A better way: standardized RudderStack schemas and data modeling with dbt
Live model walkthrough
Eric leads growth at RudderStack and has a long history of helping companies architect customer data stacks to use their data to grow.
Scaling analytics infrastructure and data pipelines @ Mattermost.
Eric Dodds (00:00)
Welcome everyone to live webinar. We have Eric Nelson from Mattermost today, and we're going to dig into some pretty gutsy stuff related to dbt and managing schemas and all sorts of interesting stuff, which I'm super excited about. We'll do a couple of intros first if you want to go to the next slide. It might be the agenda.
Eric Nelson (00:32)
Yeah. It's got the agenda here, but we can give a few intros as well.
Eric Dodds (00:37)
Yeah. I think the intro slide is after this one. What we're going to talk about today is really how you manage tables that drive critical analytics related to customers. By drive tables, I mean the tables that you actually use as the foundation for building you BI, which ultimately turns into all sorts of reporting across the organization. Anyone who has worked with customer data knows that it's extremely messy, and there are a number of ways to solve for this. The reason we wanted to have Eric do a webinar with us was we've just noticed that he and the team at Mattermost have done some really incredible things around leveraging dbt and doing some pretty interesting things around scale in a way that keeps things low maintenance. I think that's really important for companies, especially in a growth phase or they're dealing with enterprise level volume of events. Mattermost collects a huge amount of events. Having a low maintenance way to deliver rich analytics is pretty great, so that's what we're going to look at today.
If you want to click to the next slide, we can do some quick intros. I'll do mine first since it's really quick. I'm Eric. I run Growth at RudderStack and love digging into the technical side of things. Eric Nelson, do you want to give us a little bit of your background? Where were you before Mattermost? How did you get into data engineering and analytics, et cetera?
Eric Nelson (02:25)
Yeah. Sure. I started my career at Salesforce, and then moved over to Heroku, which is Salesforce company. From there, I transitioned over to Mattermost, but basically it was a slow burn. Since my degree was in management information systems, I've always been kind of heavily involved in database management and data engineering, so it was just more or less honing my skills. From there, it was kind of slow progression, and at Heroku is where I started ramping, doing a lot more collaborative stuff using open-source, Git, command line, and all of that stuff. Shout out to Alex Dovenmuehle, who is kind of my mentor as a data engineer. I'm more of a hybrid analytics engineer, so analyst and data engineer. I followed Alex over from Heroku to Mattermost, where we kind of set up a whole analytics infrastructure there.
Eric Dodds (03:22)
Very cool. All right. Let's dig in. Let's talk about messy customer data. This is huge problem, even for companies that have really strictly defined tracking plans and data governance protocols in place across teams. Stuff just changes, especially when you're moving really quickly. Let's talk a little bit about why this is, if you want to click to the next slide. I can talk through a couple of these points, and then would just love to hear about your experience across the companies that you've worked at. One of the reasons is customer data is never static. Marketing, product, et cetera are constantly trying to optimize, which means that the actual interfaces that customers are using, i.e. websites, apps, whatever, are constantly changing, which means that the data that they produce that represent customer behavior is also changing. That's just very challenging to keep track of, especially when you're trying to go through really fast iteration cycles. The reality is that the ops team, data engineering team, analysts have to be able to iterate very quickly to react to that. Could you give us just a couple specific examples of maybe how customer data has changed after Heroku and Mattermost?
Eric Nelson (04:54)
Yeah. Especially at Mattermost, I think, because of the way that the product is designed, there's a lot of functionality and a lot of enhancements that are added constantly to give a monthly release cycle for the on-prem offering, and then a bimonthly or a biweekly release cycle for the cloud offering. Really, what happens is there's a lot of new features out and a lot of new things that we're tracking. It just produces a new column or a new property in a table, an existing table, and that is the base raw table. Then we have a transform table that needs to be surfaced in BI tools. When that new column is added, there's no automated way or there hasn't been in the past to just incorporate that property and ensure that all of this data is bundled together, especially because the way that our product functions is we have a web app interface, a desktop and a mobile, and all of those data sources go to different areas within our raw database. That causes a lot of issues trying to blend those schemas and those click-stream usage data together.
Eric Dodds (06:01)
Sure. If you think about the example that you just gave, where you have three different app infrastructures, those are running three different sets of code, a bunch of different platform specific STKs, all of that makes data governance very hard. There's a point at which we want to say it very explicitly, whether you're doing your tracking plan in a Google Sheet or some sort of tool or whatever, the amount of work that it takes to align the payload structure across teams, across platforms, sometimes taking the time to do that doesn't actually benefit you as much as getting the feature out there and seeing if it works and seeing if it actually helps optimize user experience. That's a challenge that every company faces. Now, to some extent, data governance tooling, I think, still has a long way to go to manage some of the complexities across scheme and across platform, but it's hard because a lot of times the data engineer analyst roles also sort of have to pinch hit in terms of the data governance piece of it just to keep their jobs more manageable, and there a lot of times isn't a lot of clear ownership. Has that been your experience?
Eric Nelson (07:19)
Yeah. I think the one in which we rapidly hire engineers as well, ownership, there is no clear kind of articulated line. It's a lot of ebb and flow, a lot of new hires are coming in and contributing, and being open source, we also have so many outside community contributions that things are changing constantly. Sometimes we even have hackathons where people test our infrastructure, and that produces properties in our raw data house. All of these things work together to produce a lot of what would be intensive data and analytics engineering work.
Eric Dodds (07:57)
Sure. All right. Let's talk about some of the traditional ways that companies solve this, I think is the next section in the slide deck.
Eric Nelson (08:08)
Yeah. Let me move that for you. Historically, for me, I guess I'll just jump into this, a full data engineer is dedicated to SQL and the DDL aspect of things and making sure that all of these cable definitions are clearly defined, all of their columns are in there, and adding any new information and sometimes there's latency in between in additional to the raw warehouse and then actually managing the transformed one, because you have to go through a data engineering team or other people in order to get those things added, and they have a backlog of requests, so doing things dynamically just isn't really possible. Then I'll let you speak to the next two.
Eric Dodds (08:56)
Yeah. One thing that we see a lot is you can do this in a BI tool. I think the reason that throwing lots of SQL work at it just is very, very common is that it gives you more flexibility. It's really hard to scale that, but once you're actually done the work, you can do way more with the data because you have a set of tables that you can use in a variety of ways throughout the organization. It sort of gets at the whole self-serve analytics things as a foundational pillar there, but there's just a ton of work there, and it also requires a skill set that's in very short supply. Someone who has your skillset around data engineering, analytics engineering, et cetera, and doing that at the SQL level, very time intensive and really hard to scale that from a skillset standpoint, but it can be way easier to do some of those things in a downstream BI tool.
You have all this raw data, you solve for all of the edge cases and all the cleanup in the BI tool, but the problem there is that all of that work is very hard to share across the organization. It just lives in the BI tool. You can't really action it. People get access to it, so it still kind of gets at the self-serve analytics thing, but it's far less flexible because essentially you have a set of reports or the ability to build reports based on this, but to actually do anything else with the data requires a lot of manual work. Exporting data, that can be a huge amount of data. There's just all sorts of challenges there. The insights that come from the foundational data aren't really shareable outside of what you can derive from the reports. It's interesting.
Then as simple as it sounds, a lot of people just throw a ton of headcount at it. Whether that's on the data engineering side, where you have a ton of people hammering out SQL or doing constant cleanup projects, or just hiring a ton of analysts who hammer on the BI side of things. At some point, that's just not cost effective. We say okay, for a company of our size, having four analysts just try to keep our basic reporting clean doesn't really make sense. We need to be answering [inaudible 00:11:21] questions. Those are all very common solutions and very understandable, but what we're here to talk about today is a better way to do this with the underlying structure. Enlighten us.
Eric Nelson (11:40)
All right. The better way that I've identified after having instrumented RudderStack throughout our web properties and throughout our product itself is really creating a set of standardized schemas that clearly articulate their purpose. We have product schemas, as well as release candidate and quality assurance or dev schemas for testing so that we're able to easily identify our targets and which tables need to be blended for which purpose, because obviously we still surface a lot of our quality assurance data when devs are testing to make sure that the properties are being collected properly and that we're collecting all of the data that we're intending to and that the events are firing properly.
What we do once we have those standardized schemas is really leverage dbt and the macros that you're able to build within dbt to create low code master user analytics tables for various purposes, whether it's website interactions, whether it's our product usage. We have several master user analytics tables so that we can surface that insights or those blended click-stream or event data into a single table in our [inaudible 00:13:03], where we do the majority of our dashboarding and reporting. Really, the benefit of data modeling with dbt and using these macros is it's low code. It's modular, it's customizable. You have your list of schemas, you know your targets, you input your list of variables for table inclusion schema inclusions, exclusions, and then from there it more or less allows you to blend all of these tables together without riding a massive union script that needs to be constantly updated and upended to when new properties are involved.
There's also other benefits where you're creating dummy columns or null values where properties don't exist for, say, one event type and they exist for another event type. It really allows for a simplified blending of all of this data into that single master table. Also, it allows you to account for things like more or less typos from devs who are working on one area of the product. Say they're missing an underscore in a column name and there's an underscore in the other column name for a separate prope