How Pachyderm Streamlines Lead Qualification with RudderStack Reverse ETL

What we will cover:

  • How data gets trapped in the warehouse
  • Lead qualification at Pachyderm
  • How was this happening before?
  • Using Reverse ETL to turn internal data into an event stream from the warehouse
  • Live demo of Reverse ETL
  • Q&A


Eric Dodds

Eric Dodds

Head of Product Marketing

Dan Baker

Dan Baker

Technical Marketing Manager at Pachyderm, Inc.

Dan is a technology addict and automation specialist with over 20 years experience helping businesses realize the potential of their data.


Eric Dodds (00:00)

All right. We'll cover just a couple of housekeeping items. First of all, thanks for joining us today. We will be respectful of your time. So we'll breeze through this content and then leave plenty of time for Q&A. Just a couple of quick things, feel free to raise your hand by clicking the button in Zoom. And feel free to just type the questions in the Q&A as we go along, and then we can address all those at the end of the webinar. Also, if something's wrong or you can't see the screen, just feel free to post in the chat so we can see. And with that, I think we're ready. Are you ready, Dan?

Dan Baker (00:44)

I certainly am, Eric.

Eric Dodds (00:47)

All right. So here's just a quick overview of what we'll cover today. So we'll just do quick introductions, we'll do an overview of lead qualification at Pachyderm and the process there. So Dan will walk us through that. And then Dan and I will talk through the various ways that data gets trapped in silos in the stack and talk through some of the specific ways that that was happening at the Pachyderm. And then we'll talk about a really cool tool called How Pachyderm Streamlines Lead Qualification with RudderStack Reverse ETL from RudderStack that Pachyderm uses to streamline their lead qualification process. So with that, let's jump in. Dan, you're the star of the show, so I'm going to let you go first.

Dan Baker (01:36)

Thanks, Eric. So hi everyone, I'm Dan Baker. I work at Pachyderm. I'm kind of in marketing ops, like data ops, that kind of area. My role involves making sure everyone in every team internally within Pachyderm has the right data to do their job as effectively as they can. And I kind of came across RudderStack a while back while really interested in kind of consuming event data for various websites I was working on at the time. And it's proved to be a greatly successful product for where we're going with the Pachyderm customer data platform.

Eric Dodds (02:19)

Great. And just a little bit of detail on what Pachyderm does.

Dan Baker (02:25)

So probably the best description is Pachyderm's an enterprise-grade open-source data science platform that makes explainable and repeatable and scalable ML and AI a reality. We run a SaaS product as well as having an open-source version. And we have several quite large clients running on both open source and enterprise and our SaaS version.

Eric Dodds (02:56)

That's really cool. I can say firsthand, you should check out Pachyderm if you need tooling around your data science workflow. I'm Eric, I'm the Director of Customer Success at RudderStack. And I do a lot of work with many teams, but I have the privilege of working with customers like Pachyderm, and tons of other customers who are trying to build a data infrastructure. Quick overview on RudderStack, so RudderStack helps you easily build customer data pipelines. So we do that a number of ways in the stack. So event streaming, so from real-time use cases to other use cases with event streams. We also can pull structured data from cloud sources.

And then, what we'll talk about today, one of our most exciting features is Reverse ETL, which allows you to pull tables from your warehouse and turn them into an event stream. So that's RudderStack. So let's dive in, lead qualification at Pachyderm. So Dan, talk us through the challenges you were having in terms of the lead qualification process. And I think the context of what you were doing will be helpful. I know we have a lot of technical people in the audience, but the business case I think is really helpful.

Dan Baker (04:20)

So yeah, I think as it's a pretty typical kind of business case that we've got a bunch of cloud services, data being generated in those services via incoming leads or outgoing communication with those leads or any of those general marketing type processes and sales processes. And we were looking for a solution. We were having issues with getting aggregated data from all of those cloud sources back into specific cloud sources. So like in this case, we'd want to know metrics on our SaaS, our Pach Hub, and know stuff like total workspaces created. So this is like an internal value of how important a customer is. And in the same case, last workspace creation date, and account spend within the Hub.

And aside from that, we'd also want to know how engaged open-source users were in our Slack community and many, many more use cases for this that we haven't even explored yet, but we kind of know where we're going with this and know where our problems were. And until this point, we were spending our time looking around through Slack and through Hub database records, or Metabase, or those kinds of tools. And we were trying to onboard our own internal people into various tools that they weren't really comfortable with. So salespeople using Metabase isn't ideal and people using Slack as a source for understanding what our customers are doing really isn't ideal either. We need to combine that stuff in one place to give this unified view on a particular customer or lead. And so that was our issue. And that's how we came across what RudderStack was doing with their Reverse ETL and solved that by that solution.

Eric Dodds (06:33)

Great. And just a quick question on that, your experience from the sort of... You have a wide purview because you work on the data and the ops side of things, what were the conversations like internally as people were feeling the pain of this, and did that translate into a lot of requests for data from your team?

Dan Baker (06:55)

Yeah, I think there was just an internal lack of understanding reality. And it became obvious that we needed to combine this stuff in one form. And what I was seeing was that yeah, we could go down the dashboard route and we could give people all of this data in one place, but it was more about certain teams were happy with the tools that we're using. So marketing was happy with using HubSpot, sales were happy using Salesforce, neither wanted to switch to each other. And it became more about let's get that aggregated data or that unified view of data back into those tools so that people can see a true view of that customer at that point. And so yeah, I guess it was more about people not really wanting to budge from where they're at because it's just outside their normal workflow.

Eric Dodds (07:48)

Sure. That is really helpful context. So Dan and I thought it'd be good to quickly run through before we get to the specific use case with Reverse ETL, and Dan can talk through their stack, which is really cool. I thought we'd just quickly go through how data gets siloed in the stack. This is something I think every company experiences on some level. So I'll just go through a quick overview. And then Dan, maybe you can talk about the dynamic of this at Pachyderm. So this is something that I think is really common at a lot of companies, so data trapped cloud tools. So you have data flowing from websites, apps, internal systems, et cetera. And usually, that lives in some sort of cloud toolset, right? So as Dan was saying, maybe marketing is using a tool like HubSpot or Marketo, sales are using Salesforce. And then the marketing team is using Google Analytics and Google Ads.

And what ends up happening is you have each of that driving reporting in different ways. And then of course that creates misalignment because not everyone is using the exact same definitions and the reporting is different and you have all these challenges. And then when data is trapped in cloud tools and you have that misalignment, and it's hard to do the analysis to your point, Dan, I think a lot of companies end up in a place where you're just trying to get data out of a lot of those systems. And as crazy as it sounds, a lot of it just ends up in spreadsheets because you have to make decisions, you have to hit quarterly numbers. But that's not scalable, especially with the amount of data that many companies are collecting. Can you talk about just briefly a couple of examples of what were the problems caused by data being trapped in cloud tools?

Dan Baker (09:38)

Yeah. As you quite rightly said Eric, the obvious go-to at this point is that you pull data directly from these sources, these tools, into spreadsheets, like Google Sheets is a great source for that kind of stuff. And we all know these integrations with all of these tools with Google Sheets and we can get data out and we can present it to the right people, it doesn't allow the joining of that data, which is a key part from our perspective. And it was more about... It's kind of a twofold approach, right? So people want to use the tool they're already using, as I mentioned previously, whether it's HubSpot, Salesforce, whatever tool, what everyone's used to using.

And that might be in some cases, it might be people that are just used to using Google Sheets. So let's not discount that as a great way of presenting data. But you want to get that data. You not only want the data from those sources, just to be direct export, but you also want to be able to join that stuff and give people that more true view of that picture. And that's exactly what we were seeing at Pachyderm. Wen it comes to our Hub system, our SaaS product, that database was more closed off in the sense that it was just a Postgres database and a lot of people don't have access to that. And there weren't the tools in place just for direct exports into Google Sheets. And we needed that data in all those other tools.

Eric Dodds (11:16)

Sure. Well, so let's say you get the data unified. So as Dan and I were talking before the webinar, we said, "Okay well, one of the great things about modern stacks is you can solve that problem by either sending data directly from your websites and apps to the warehouse, which is really common, but then also pulling data from those tools into your warehouse." So now you have unified data and you can do the analysis. So Dan, I would love to know what that process is like for you. But then also you were talking about how even though that was really helpful, you still face the problem of teams wanting that data in their own tools. Which even though you've unified it and solved the problem of data being trapped in cloud tools, you get it out of there and into the warehouse. But then the analysis is sort of trapped in the warehouse, which is its own unique problem.

Dan Baker (12:16)

Yeah. It's exactly as you said there, Eric. Moreover, the more tools you put in front of any kind of team, the less they're going to get used to. People are comfortable with one, possibly two, possibly three tools that they use on a daily basis. But the more stuff we chuck at them, the less it's going to get used. And we see this with an awful lot of companies we see this with dashboards, they're great from a high-level overview, but they don't really get used in the way that you would expect on a daily basis by most of the internal teams.

And so from our perspective, we were trying to get that data not only to be visible outside of those dashboards but more to be actionable outside of those dashboards, within marketing drips, within Salesforce reports so that we can see the status of a lead, how active they are, what they're doing, essentially lead score, based on our own internal metrics to build a bigger picture of that. And that was definitely an issue, and the idea of having that data in the database is great and the dashboards are great, but it feels like where we were at maybe 10 years ago. Maybe not 10 years, maybe five years ago, that was kind of the state of play as a standard thing five years ago. And I think we need to move on from that and be in a place where that data is now freely available in all tools that need that data.

Eric Dodds (13:43)

Sure. Yeah, it's interesting. We've heard lots of companies talk about how unified business intelligence from the warehouse is really powerful because it can really change decision-making in a powerful way. But it's hard to use that to drive the actual tactical customer journey. Like you said, triggering marketing emails or helping SDRs prioritize. Well, let's talk about how you broke those data silos at Pachyderm. So we'll talk about both of those use cases because he had data trapped in the cloud tools and then data trapped, or analysis trapped in the warehouse. To talk through how you use different tools to get the data A, out of the cloud silos, and then how you got it from the warehouse back into the tools that teams were using.

Dan Baker (14:43)

Yeah. So I kind of looked at this as being a problem from the get-go really, we were working on creating a data platform for Pachyderm and I knew that this was ultimately going to be a problem that I needed to solve. And when choosing tools to implement as part of our customer data platform, that was a key consideration. How are we going to use this data rather than it just being sat within our warehouse or within our BI tool? That was a key part of it. And so from a kind of data source perspective, like we were having issues with people not really understanding Google Analytics, HubSpot wasn't being really used to its full potential because people didn't know how to access that stuff inside HubSpot. And the same applies to all of those other tools.

And so as I mentioned, the key part of that was how do we get that data on one side, we want that in a dashboard so that people can see that stuff. And that works to a degree and that was working for us to a degree, but we needed that more aggregated view on that data, so total numbers of things, and cost analysis and that kind of stuff. And so the key part was really piping that data back into... Processing out of the warehouse, storing it, materializing back to the warehouse using the tool that we were using Sigma, and then processing that back through with Reverse ETL to solve that problem that we were currently under.

Eric Dodds (16:26)

Yeah. And just give us a quick overview of what does it look like to use Reverse ETL? I'll do a quick demo, but we'd just love to hear from your perspective. I mean, what is it like to use that day today? So you have some sort of analysis in the warehouse that you've derived using Sigma and then what does it look like if you want to take that and push it back to the stack?

Dan Baker (16:47)

Yeah so, it's incredibly simple. We imagined this was going to be way more difficult and I architected a bunch of solutions that were going around the houses or as I like to say, sticky tape solutions that would almost work or do what we needed. And when RudderStack came to us with this feature, this saved the day really. It's so easy, we just turn it on, we point it to a view within our database that Sigma already generates. It writes that straight back out to the database. We process data in Sigma, we store that back to BigQuery, and RudderStack is able to read directly from BigQuery that view, which is essentially just an SQL query. And it pipes that data back into whatever destinations we're choosing in the same way that we were already used to doing with RudderStack in terms of connecting sources and destinations.

We just have a small config level that lets us map fields. So we can say this field from the SQL result maps into this field, in this destination. And we can pipe that through to any destination that we like within RudderStack within reason, but all the ones that we're looking at. And it just unlocks that process in a really simple way, it lets us do that on a scheduled basis. So we can choose a time-frequency in which we want to send that data back in. And our staff is pretty flexible, we're going maybe on an hourly, in some cases and a daily in other cases for this data because it's not critical at that point. But it gives us that true flexibility to use a tool that we're already used to and in a very simple way.

Eric Dodds (18:39)

Yeah. We'll take a look at that in a second. One quick side note, you had mentioned this and I just thought we have to include it because I thought it was such a good point, when we were talking earlier this week, you said data validation was a big piece for you when you started to get all of the data into the warehouse and analyze it. Could you talk through the disparities that you saw using cloud tools and what that looked like when you actually unified it and then started leveraging Sigma to do deeper analysis on the raw data itself?

Dan Baker (19:13)

Yeah, we're in a strange position where I think anyone who's just using these kinds of cloud tools is blindly accepting the data that they're presented in the admin panels or the metrics panels for those tools without any question. And one thing that we were suspicious of, and a lot of people seem to be suspicious of this is the ad traffic data, PPC data. We would spend a large budget on PPC campaigns and we'd see a sizable number of impressions and clicks coming through to our Hub product. But we were suspicious of where that really turned it... Whether those were real clicks or wherever they were at, bots, or whether the metrics were being inflated.

So all along, we were thinking well we can go back to the old school, weblog type option and try and track this down. And a lot of people online talk about doing that route and you can claim money back from various providers and that kind of thing. But by this point, we'd already implemented RudderStack's event tracking on all of our sites. So aside from the standard metrics that we were getting from Google Analytics, we had page tracking metrics from RudderStack directly being dumped in our warehouse. And we were suddenly able to access those and compare those to the Google Ad metrics and the Google Analytics metrics and see that there was a huge disparity in the values there, some were in the region of 10X difference between what Google was reporting to us and what we were actually seeing on our site.

And so we're now looking at building up a lot of this stuff and see what we can do in terms of reclaiming some of that budget back from those tools. And so it just highlights to me the blindness of our belief in the data that we're being told by these cloud sources. And I think one, if we can get that data straight out of a cloud source, we can at least examine that for ourselves, rather than looking at the way in which they've presented that in the dashboard. And two, if we're able to double-check that by using some of our event tracking or some of the way of providing a data source for that same stuff, then it gives us a way of doing that directly within the warehouse and within the BI and that sort of thing.

Eric Dodds (21:46)

Great. Well, I know that was slightly off-topic, but I thought you explained that so well, I wanted to make sure we covered that. Well before we jump into a quick live demo, because I want to leave plenty of time for Q&A, here's the Pachyderm stack. And first I have to say it was really cool just seeing this together. I think you've really architected a really cool stack here. Do you just want to give us a quick walkthrough here and I want to make sure everyone has a chance to see it, so I'll leave it up on the screen so they can ask any questions, but do you just want to give us a quick walkthrough here, Dan?

Dan Baker (22:19)

Certainly, certainly. We're pretty standard by looks of things from my research online into other kinds of startups in a similar position. We're definitely a victim of signing up for solutions based on internal preference just to tools. So there's definitely some kind of duplication going on with cloud sources and that kind of stuff. But it does build a nice picture out of all of the standard type tools that everyone's using. And so we have a bunch of data sources on the left-hand side, the obvious ones are event data stuff from the website, like HTTP tracking and device kind of stuff that's going on there. From our perspective, the data is routed via RudderStack into BigQuery using RudderStack's JavaScript SDK, and its Go SDK for our backend products.

And then we also have a bunch of cloud sources, which ideally we were looking to get the data from those into our warehouse. So we use some of those ETL tools, Stitch and Fivetran in between to do that kind of thing. And that enables us to extract that stuff straight out. And I know we're going to be investigating the RudderStack solution to that. They've just launched the extract tool that RudderStack has got. And I think that will replace some of what we've got going on with Stitch and Fivetran. We also have as I mentioned previously, we have an open-source product. We have quite a large open-source user base. And part of the open-source product that is completely configurable is some metrics reporting that comes back to us. And that's a call home, checks on the version check to see if they need to update their product and that kind of thing. And we receive quite a large volume of that.

This is something that's easily configurable within the Pachyderm config, you can just turn this off so you're completely anonymous, or you can leave it on and get the updates from us. But that large volume of data we were ideally looking for is streamed directly into BigQuery on a real-time basis. And because of the volume of it and the real-time status of it, we went with a tool, Webhook Relay, which enables us to do that. And it directly streams data into Google BigQuery. From that point, so we've got our data now in the cloud warehouse, we use Sigma's tool there, Sigma Computing, and that tool is an equivalent of tools like Looker and those kinds of products, but way more user-friendly. Way more cost-effective and way more user-friendly for small startups. And it enables us to onboard people very quickly, because it's a spreadsheet-type interface that everyone's familiar with, but lets you query billions of rows of data, real-time as you're looking at it and get the results that you want.

So Sigma was a major part of that stack. The next part of that is, as the whole part of this webinar is all about, we wanted to get that data back out of those two tools back into our cloud sources. And that's where the RudderStack Reverse ETL fits in. And so we've got this purple line that goes back from BigQuery to Rudderstack and RudderStack pipes that data straight back into those cloud sources that we've selected via that tool. And that those being at this stage HubSpot, Salesforce and Outreach in our case, because that's the most important in terms of communication with our customers. I think that explains everything that's going on there.

Eric Dodds (26:24)

That was an extremely efficient explanation of a graph with so many lines and boxes. Just one quick clarification. So you use Sigma to build the tables that are populated in BigQuery that then get pushed back to the stack through Reverse ETL?

Dan Baker (26:45)

Yeah, that's right. So obviously Stitch and Fivetran and Webhook Relay all write the data into their own data sets in BigQuery. So we needed at that point to be able to build out a report from that with that aggregated data that we wanted to go back into our cloud sources. So Sigma allows us to do that. It's incredibly powerful and by default will allow us to save stuff out as a materialized view back to BigQuery, and RudderStack just ingests that. So incredibly simple to set up and amazingly powerful.

Eric Dodds (27:22)

Very cool. Well, I'm going to do a very quick demo here of Reverse ETL. Can you see my screen, Dan? Just want to make sure.

Dan Baker (27:33)

Yeah, I certainly can Eric.

Eric Dodds (27:37)

Great. So this is RudderStack, this is our directory of sources. So you can see we have warehouse sources down here. And I'll just show you, we won't go end to end here, but I'm just going to add Snowflake as a source here. I've already put my credentials in. So I'll choose that instance. And then you can see here that we give you the ability to select a schema and a table. So I have a schema called Eric Data and then I am going to grab a table called multiple forms to submit. So let's just say, we want to see people who have submitted multiple forms on our website and send that data to a downstream tool. So what you'll see here is that we pulled in a sample of this table. So this is a derived table in Snowflake, and we require an identifier, so a unique identifier. It can be the user ID or the anonymous ID. And so you can actually just select that from this dropdown list. And you can see that RudderStack already read the columns and presents you with options.

So we require a unique identifier and then you can see that that is passed through as the unique identifier. And then the other columns in the table are passed through as traits. And so what's really neat about RudderStack Reverse ETL is that it takes this table and actually translates it into an event and pushes that event to your downstream tools just as if the event were coming from an SDK, which is really, really useful in terms of managing conformity and downstream destinations. One other note is that you can add keys here. So maybe you would want to say forms submit count, and you can actually include additional keys if downstream tools need to receive data in a particular format.

And then once you complete the setup, you can configure the schedule and add destinations. And on the schedule, say every 30 minutes or 24 hours, whatever you want, we will send this whole table as an event stream to your downstream tools and update everything. And that is how Dan gets the data from BigQuery into HubSpot, Salesforce, et cetera. So with that, let's do some Q&A. We'd love to hear any questions from any of the attendees. And if you want, you can raise your hand by clicking the button and I can unmute you if you want to ask your question in audio and if not, feel free to just put your question in the Q&A, and Dan and I will address it.

All right. One question came in. What was the process like of installing the SDKs, Dan? Did you work with your devs and what did that look like?

Dan Baker (31:06)

Yeah, so there were kind of several options for that, but it was incredibly simple. My experience was previously using Segment and obviously, the RudderStack system is completely compatible with the way in which Segment does stuff. So it's just a standard from the JavaScript point of view, just a standard JavaScript include in the head of the documents and that enabled us to do all the tracking that we needed. And then you can obviously fire off custom events per element within your page or any of that kind of stuff. But for me, it was an incredibly easy integration point. And our front-end team I don't think had any issues with that as well from a product point of view. I mean, I did the website side of stuff and the docs and they did the product so very easily.

Eric Dodds (32:01)

Cool. All right. We had another question come in. Where do you set the frequency of updating data to RudderStack from the data warehouse? So I can actually just show you that here I'll click on next. So you can see the run frequency here. So it defaults to 30 minutes, but you can actually configure it to run even faster. So we have several customers who have sort of low latency use cases. And then of course you can schedule the time as well. So that's what it looks like. And the way that works is we'll just go pull the table every 30 minutes and translate that into an event stream. Great question. Any other questions? Alrighty. Well, we are at the 40-minute mark, and as I said, we want to be respectful of your time. Thank you to everyone who joined today. Feel free to send... Oh, we had another question come in. What's next for RudderStack?

Great question. Well, I'll actually tell you the thing that we're really excited about is continuing to build out API first features. So one of the things we're working on right now that's very exciting, it's probably one of our most requested features, is we have a transformations feature that allows you to run JavaScript on the event stream that's coming through, including events that are coming from Reverse ETL. And you can configure all of that in the UI, it's very powerful. But we have had many customers ask for transformations as an API so that they can manage that as part of their workflow and leverage version control. And that is going to be really, really neat. So the team is currently working on that, and we're really excited about that coming up. And you'll see more and more of that sort of translating features into APIs that make it easier to integrate the product into your existing workflow. And of course, we'll continue to add additional integrations as destinations and sources. We're over 100 now, and we do one about every two weeks.

Any other questions? We'll send the deck and a link to the video as well if you want to watch it and... Oh, another question came in, this is great. Do we assist with integration? Yes. So the customer success team... Well, I guess the question I would have is what do you mean by integration? But in terms of setup, we work really closely with our customers to help them go live and get into production. On the integration side if what you're referring to is integration, as far as the destination... Okay, there we go, got some clarification there. We don't actually... So if you don't have software developers, we don't actually do the implementation for you, but we have several partners that we connect you with who can help you build out all the implementation. So yeah, really common, we have several really good partners, both here in the States and in Europe and they would be happy to work with you to do the actual instrumentation and setup.

All right. I'm going to wait again because the questions keep coming in right as I... I don't wait quite long enough.

Dan Baker (35:55)

Now we've got the awkward wait.

Eric Dodds (36:04)

That's right. I'm okay with awkward silence though.

Dan Baker (36:08)

I'm not.

Eric Dodds (36:10)

Do you have any questions, Dan? I mean, you're an expert on this.

Dan Baker (36:13)

I'd just like to say the working relationship with RudderStack from Pachyderm's perspective has been amazing. The team's been great. We've had very few issues with any of the integration side of things and any of their destination integration side of things as well, like making sure that data's flowing through to our destination as it should. Anything we've found has been incredibly quickly dealt with and I can't fault them in one sense for that. So, brilliant. Thanks, Eric and the rest of the team there for helping us be successful with this.

Eric Dodds (36:50)

Of course. All right. Well, that is a great note to end on. Feel free to shoot any questions our way via email. You can just send them to me, and I can loop Dan in if you have a question for Dan. But I really appreciate everyone participating today. Thank you for the questions and we will follow up via email.

Dan Baker (37:14)

Thanks, Eric.

Eric Dodds (37:16)

Thanks, Dan. Thanks to everyone who joined.