Transforming Data: RudderStack Vs. Segment

Webinar

Transforming Data: RudderStack Vs. Segment

Duration1 hour

Speakers

Ryan McCrary

Product Manager at RudderStack

Max Werner

Owner at Obsessive Analytics Consulting

Webinar Details

In this technical session, RudderStack Customer Success Engineer Ryan McCrary teams up with Max Werner, a senior data engineer who has experience implementing Segment at companies like Proposify and Warner Media, to compare RudderStack Transformations with Segment Functions.

They cover the limitations of Segment Functions, which locks you into proprietary SDKs, and they detail the benefits of RudderStack’s flexible approach to data transformation, which allows you to easily manipulate data in flight. Once they lay the groundwork, they dive into some specific use cases for RudderStack transformations like data cleaning and modifying event payloads.

We'll also cover:

The limitations of Segment Functions
The benefits of RudderStack’s approach to data transformation
Using Transformations for data cleaning, data enrichment, and modifying payloads
Live demo of RudderStack Transformations

Transcript

Brooks Patterson:

All right. Well, welcome, everyone. I am Brooks. I'm on the growth team here at RudderStack and excited to bring this technical session to you this afternoon, at least East Coast afternoon. And today, obviously, you know we'll be talking about data transformation comparing RudderStack vs. Segment, and excited to have Ryan McCrary and Max Werner here to be your guides for the next hour or so. So, with that said, I will let them take over from here and do some better introductions, and we will get things moving.

Ryan McCrary:

Cool. Well, thanks for having us. Max, thanks for being here. Do you want to go ahead and introduce yourself and let everyone know who you are?

Max Werner:

Yeah. Well, I'm Max and I do CDP consulting, helping people plan out their CDP implementations or considering the various buying options and helping them go away from spreadsheets a lot of times. That's what I do.

Ryan McCrary:

Cool. And I'm Ryan, I'm a Customer Success Engineer here at RudderStack. So, admittedly have much less success with Segment and the like than Max. But I've worked with Max on a couple of projects, and we've gotten to do some pretty cool things together. And, yeah, we'll share some of that today over the next few minutes. So, Max, we'll jump right in. I'm going to refer to Max mostly for the Segment side of things. Like I said, he's got a little more experience on that front. But Max, you're going to jump in. We'll go through what Segment functions and protocols are, the way that RudderStack approaches those as transformations. And then we'll go through a couple of use cases of transformations, and how those work within RudderStack. And then from there, we'll do just a live demo of a couple of things that Max and I have used in day-to-day data processing, to get some stuff done with our transformation. So, so we'll start with the limitations of Segment functions and protocols. Max, you want to take that one.

Max Werner:

Yeah. All right. Well, the benefit of being a consultant is I get to play with all the tools both on the Rudder side, but also the Segment side as well as [inaudible 00:02:26]. And so, today's theme is data transformation. So, Segment offers the functionality of basically writing your own source and for most Segment functions they throw you up a webhook URL, and you could just write some code in there, and basically have it flow into Segment from there. There is a couple... Sounds great, sounds really great, and it is, but it has some pretty severe limitations. I mean, you're building a large chunk. The first part of your vendor lock, and you're building a large chunk of your data pipeline inside somebody else's platform. So, you're locked to that.

Max Werner:

There are the billing pieces. You get to earn various plans depending on what your contract looks like. So and so many compute hours off of function time. So, you could potentially run into overage charges if you use that feature as a core piece of your pipeline. And you also have no control over the finer configuration details of these functions. I mean, under their hood, they're lambda. They're lambda as a service, particularly, but you can't fine-tune it in terms of giving it more time to run, more RAM or upload your own packages or anything like that. So, if you want to do something simple, cool. If you need to do a lot more complex things, you're running into some problems.

Max Werner:

The next part, it flows together with the vendor, but it's the infrastructure that you're getting locked into here if you build your Segment functions out because they give you that endpoint URL, which is basically just FN segmentapis.com/ and then hash value. The problem is that is both an identifier for your function and the right key that that source represents. So, if you have some sort of key rotation requirement as part of your cybersecurity policy that will create a new function URL if you need to rotate that. So, if you have that in a mission-critical thing that is set up in five different tools that send data there, have fun updating that all with the new URL without things breaking.

Max Werner:

Yeah, I have not run into that problem at all before cough, cough. I just mentioned this, you can't upload your own packages is basically just the version of Node that they provide you with, which curious enough also has some missing pieces that normally are part of Node. Most notably, I ran into the issue too long ago that was the crypto module was missing. So, if you want to build a UUID or something that is a really handy module to have didn't exist in that Node version. And yeah, leading more into the development piece, you're locked into that graphical user interface. There is ways to programmatically deploy things, but for the most part, you have to edit and test your things inside that and have to also make sure that it sends along with events that you have to listen for life, so you can inspect payloads and stuff. So, it's a certain experience.

Ryan McCrary:

Sure, yeah. And still nonetheless allowing you to modify your data as you're routing it in flight. And so, some benefits as well. As far as Segment functions as destinations.

Max Werner:

So, where you basically get an endpoint URL on the source side, and you call your identify track or whatever inside that function, the destination works the same way. But you just hook into those events. So, you say on track, do something, on identify, do something. The idea being here that you can implement your own API connections. If there's some service that you're working with that has an API, but it doesn't have a Segment connector, you can just say, "Well, if I get an identified call or a track call, there's my subscription started." I call the API endpoint for this system and switch a flag there so that somebody gets paid or who knows what. The limitations are mostly, of course, the same in terms of the development experience, and what you're limited to. The main focus here on the destination side, if you're..

Ryan McCrary:

Oh sorry.

Max Werner:

... is that you're working now, or almost always with external APIs, which means you will need some sort of API key some secrets. So, you have that you need to, again, put into that Segment UI, and come back to the whole key rotation thing. Let's say either that API key is invalidated or something you need to change now. That's you spreading your keys further around from maybe a centralized key store that you might have. And of course, when you're building your integrations, the air monitoring is even more important here and is somewhat limited on the second side compared to something CloudWatch.

Ryan McCrary:

Right, right. Yeah.

Max Werner:

Yeah. And the third piece of transforming data is, of course, the transformations piece inside protocols and Segment. So, like their gated data governance module, right? I mean, again, it's a graphical user interface. So, you can click things together, which is inherently going to be always more limited than writing code to transform your data almost regardless of what that language is. So, the transformations feature in protocols, it gives you three options. You can rename an event, you can edit the properties of a track, or identify a group event. So, properties or trades, but you can only rename the key. So, if you have revenue, you can call it to purchase information or something, but you can't really do anything with the payload itself.

Max Werner:

There's a little bit where you can unserialize a JSON string, but for the most part, that's it. So, if that is your use case, it's great. If that is not your use case, and you may need to do a little bit of data cleanup or even want to do something as simple as dropping a couple of keys here, well, you can't. And then, of course, the last piece is part of how Segment is built overall that instead of connecting one piece to the other it's always like you're instantiating a new destination or a new source of the same type. The same thing goes with these transformations. So, you might have your transformation library have five transformations that do the exact same thing that are hooked up to different pieces, which just is not super great for maintainability.

Ryan McCrary:

Sure. Cool. Yeah, so I can jump in and feel free to interrupt me at any point if I miss anything, Max, but to be on the other side of that. What we see as the benefits to RudderStack as a transformation. It is just baked into the product, right? It's part of what we do. It's a core tenet. And really, the high-level idea is that everything that goes through RudderStack, whether it's a traditional event from an event stream, something from an ETL cloud extract source or even our reverse ETL solution, meaning pulling in the warehouse as a source. Everything gets converted to a JSON, a JavaScript object. And so, everything we're going to do is just going to be good old JavaScript. We're just going to be able to operate on that object and do really anything we want to do in JavaScript, whether it's any of the things that Max mentioned, or we can use third-party APIs from within really a lot of extensibility there. So, yeah.

Ryan McCrary:

That second point, I mean, these run in real-time over your event stream. So, everything's happening in flight. Again, it's customizable per destination, and even on the connection, so to the point that Max said we're able to reuse these or allow customers to reuse these on various destinations and just different connections that they might have. And then we have some nonstandard connections as well. So, if there's a destination that we don't support in RudderStack, we can use just the traditional webhook destination that we have in conjunction with a transformation to essentially shape that payload into what that downstream API is expecting, and really build a connection based on the combination of a webhook destination and a JavaScript transformation.

Ryan McCrary:

And then we have recently added some code maintainability stuff with adding function libraries. We have some shared libraries that we provide for RudderStack. We have a repo of shared libraries that people can grab and use on their own. And then really anything that you're finding yourself reaching for using over and over again. Max mentioned crypto libraries. We have a couple of solutions for that where as long as there's no external dependencies we can load a crypto library, call it from within a transformation. Whether we're hashing for a downstream tool, or trying to generate a UID, we can do those from within.

Ryan McCrary:

And then the last piece is our mindset here. RudderStack is trying to be API first, and it's developer-friendly as possible. And so, being able to... We have these transformations typically managed in a GUI, which as you mentioned is not most people's favorite place to manage code. And so, adding the ability to upload these transformations via API, and even recently, we've added a GitHub action, which we'll go through in a little bit for that where these transformations can be built in whatever environment you choose on your own local machine, push to a repo in GitHub and then test it against an input and output that you can provide. And unless that action passes, they don't get uploaded to the API. So, kind of a CI flow for transformations, and for the code that they use. Anything..

Max Werner:

Yeah, especially the library part is super exciting because there are these things that you will always want to do. There's a certain set of functions that you will always want to run over your event stream to clean some things up, and we'll have some examples later. So, instead of writing that into every single transformer for every destination that you need, you can just factor that out into your one clean this payload function, and then just reuse that all over the place.

Ryan McCrary:

Yep, exactly. Cool, we'll go through a couple of use cases. And these are some examples that I've pried out of Max and some things that we've used internally in the past here at RudderStack, and with some customers that we work with. But this first one is some data cleaning. This is a Max one as you can tell by his email address being in there. Max, do you want to walk through what this was and what you're trying to solve?

Max Werner:

Yeah, I mean everyone loves entering 30 different field mappings in their destination configurations, which is awesome because your engineering team implements a trade or a payload part or a property in a certain way. Or let's say you're getting sales amount or something, or subscription amount, whatever you get, and the custom field in your Salesforce system needs to be revenue or something. You can change that in a transformation super easy just for that destination, and your warehouse or anything else still receives that that sales amount piece. So, that was actually a purchasing decision on a client of mine for a Segment protocols because they just needed to rename some properties, and that was instead of just writing one line of code, you can just think you have to buy this entire software suite or an extra module, so much nicer here.

Max Werner:

And then, yeah, eventually your customer email. I get emails every so often like newsletters or various things and I sometimes get them twice or three times because... Also, when I enter something on my phone and capitalize the first letter or something that, I tend to be in some people's systems at all times based on just that's obviously the same email address just a different casing. That's in many systems more than enough to create a second lead, or it's always great when you get the sales emails and being like, "Man, I've been a customer for three years. It's all right." It's just a different casing.

Max Werner:

So that's, for instance, something that you would be extremely hard-pressed if at all to achieve in a Segment protocols piece or something. But just, as I mentioned before, these standard cleaning things. Just down case emails, look if there's an email trait in your traits or properties, and just lowercase it. That way you're immediately eliminating some of these very low-hanging data hygiene fruits.

Ryan McCrary:

Right. Yep. Yeah, and because these are, again, just JavaScript objects, we can access those however we need, and it allows us in conjunction with what you mentioned with the data governance piece of some of these may be fixes that we're thinking of beforehand of as you can see here, Max loves to lowercase every email that he gets regardless of whether it's already lowercase, and that's proactive, but some of it is reactive where someone on the team is sticking with that sales amount key, even though we've discussed as a team that we're going to go to revenue now. And so, it's very easy to, obviously, you want to clean that stuff on the front end, but we can also begin to generate a mapping in each destination. And even using that data governance piece to see what people are trying to send through as data, and even guessing some of that. So, it's pretty common on our team if I'm correcting something like this. For example, the sales amount where it's coming in snake case, and I want a camel case. I'll go ahead and fix that. But then I'm also going to add the sentence case, or whatever else it is as well to go ahead and assume someone's going to throw that in at some point. I already know that I want it in a camel-case. So, go ahead and get that where I want to be.

Max Werner:

Because if you're trying to change the API names on a CRM system for something without telling anybody and things break. If you just fix the data in flight, that saves you a lot of trouble.

Ryan McCrary:

Yeah, absolutely. Yep. Another piece that we do quite often with these transformations is data enrichment. So, I mentioned earlier the ability in flight to use a third-party API, so we can... There's multiple use cases for this. This is just an example that we use internally here at RudderStack. But we have a number of customers that what I mentioned earlier as well with the custom webhook destination if they're having to get an authorization token from some type of secret store or generating a JOT, they're able to do that from within the transformation, reach out, get that, and pass that on in the headers or in the body of that request in real-time.

Ryan McCrary:

And so, this is just a... Sorry, and then a lot of customers will actually have internal data stores where they may be enriching or validating contacts there before they go to CRM. And so, the only limitation with these transformations, and this comes into effect much more with this enrichment piece is that each one is basically a smaller [IVM 00:18:17] that's going to close up when it's done. And so, that each one is limited to about eight megs of memory and four seconds of execution time, which hopefully is plenty of time.

Max Werner:

And it's per event.

Ryan McCrary:

Yes, per event. Yap, dies each event. At the moment, we don't have any way to cache in between, but that's something that we're looking at as well. And so, this is a pretty straightforward one. You can see here for specific identify events we are basically using Clearbit to look at that email, and give our sales team a little head start on as you can see some specific traits we're adding in to Salesforce. This is going to Salesforce. The sales team is looking at these and reaching out to leads. So, this isn't happening on a large volume of events from our side. This is a pretty low-volume source. It's a very specific action where someone's looking for a quote or demo.

Ryan McCrary:

And so, we're able to hit the clear API, look up that person based on the information that Clearbit provides for us, and just really put in some basic geographic info and a company count, employee count for the sales team to begin to sort those leads before they do their own research on them. And probably, so they can fight over the ones that have higher counts. Who knows? I don't know what sales guys do. So, I'm just in the data world. But yeah, and so this is a pretty common use case. Yeah, and so just being able to hit a third-party API really opens up a lot of uses for that. Anything up to add on that, Max?

Max Werner:

Yeah, for sure. I mean, all these examples, they're purposefully, if you look at that code, very simple pieces because that's what I said earlier was code is better than a DUI because you can do these things, and then your payload going into your CRM system. When it creates a record, a direct pass that enriches information in it, and you don't have to either try and enrich it in the CRM system, and Clearbit is pretty good in terms of integrations with things that. But building these flow steps that is just like a lead comes, enrich it, then wait for the lead scoring until you get the enrichment and so on. You don't need any of that. It's just the record directly comes in with that. So, these very easy to instrument pieces here save you just a lot of complexity and time burry further downstream. So, it's really, really neat stuff.

Ryan McCrary:

Yeah, and from a sales perspective, this is a more simple version of what we do, but we actually have, and this gets into other parts out of the scope of this webinar. But in the warehouse for a lot of these leads, we're calculating a lead score engine generating a high-value lead table, and then piping that back to RudderStack via warehouse action. And so, we actually enrich those events a little bit differently as well since they're coming in. After the fact, they're not directly going to affect the creation of leads, but they're just updating the leads with the lead scores for certain tools, and then enriching those with additional information as well. So, just doing things that can help the sales team without me having to help the sales team is the idea. This next example, this is another Max example. I'll let you go through this, but this is a pretty common, very common use case, I would say for, for transformations, and has two different features highlighted in it.

Max Werner:

Yeah. I mean, Ryan mentioned before that it's a JSON object at the end of the day that the payload that you're both getting and sending on. So, Salesforce is always the fun example where it's very particular about how the keys in the trades or the identified properties have to be in order for it to accept it. So, again, trying to change something on the CRM side is usually difficult or can create other downstream issues. But you can just maintain this very easy JavaScript object. I call it Salesforce mapping. You can say, "This is my event key that comes in. This is what it needs to be for Salesforce." And then you just loop through and move it over.

Max Werner:

So, if you get this demo object there with sales long-term length, and you run that through that script, boom, you get your proper sales revenue, and contract duration, whatever matches Salesforce without messing up that event stream for other destinations. If you want to do a little extra math on it, again, you can do that. If it's 1337 for a contract of 12 months, that's 111.42 a month. Indirectly do these basic things right in the event stream, and you don't have to make calculated fields in Salesforce or whatever in order to do that math. It's just here it is. So, it saves you, again, downstream steps, especially if you have to work with salespeople and marketers that don't like to do these things, and they just want to have their Salesforce dashboard, and go from there then they don't have to.

Ryan McCrary:

Yep. Yeah, I mean, we obviously pick on Salesforce a lot. But there are a lot of API's that are like this that are very opinionated of custom fields and even field types. And so, one of the benefits that we get out of using something like RudderStack, or any kind of CDP, or data pipeline is the goal here from an engineering standpoint is to be able to have just a single implementation. So, we're instrumenting these events one time. Every time we're at a destination, we're not having to go back and deploy code or test code or mess with production code. And so, if we're planning ahead, well, for thinking through these use cases, and building these events how we want them in our code base, they're often going to have a lot more traits than each individual destination needs.

Ryan McCrary:

So, for something Salesforce where they're not going to take... If we send all of our custom keys through, A, they're not going to be formatted correctly, which they're going to reject, and B, even if they are, we're sending extra data points that they're going to kick back the event for. And then you have something further like Braze where Braze will take all of those data points that you send, but you get charged per data point and per change. So, the idea is instrumenting the event once and then we can use these transformations to not only reformat but to filter, whitelist, whatever you want to say to get rid of some of those keys that don't need to go to that destination, whether it's for API limitations or just for cost savings.

Max Werner:

Intercom is a fun example of that as well where any trade that you pass in will automatically create a custom field in Intercom, which you are, A, limited on how many you can have, and B can't delete, which isn't just a fantastic combo. So, being able to just clean that up in your event stream ahead of time, especially when new things get added is, again, saves you that next support call with your sales rep or account manager. Can you please increase the limit?

Ryan McCrary:

Yep. Yeah, that's a good point. I hadn't thought about... I forgot about Intercom. That reminds me of we have another integration. I can't quite remember what it is. But they have created that field that's actually mutable. So, if you're not able to clean that data in flight, and you have created that key in your event, you're going to continue to update the creator of that event on contact in an email marketing tool. And like you mentioned earlier, that's how you're going to find someone that's been a customer for three years getting a sales email.

Ryan McCrary:

Yeah, so pretty useful, and a lot of times we don't know these things when we're instrumenting the events initially. These are things that come up as we add new products or get more in depth in our use of some of these downstream tools. And then the last one before we jump into some demos. This is a pretty common one as well. This is another Max one. There's also an example of this in the repo that's linked here for some sample transformations. But we mentioned scrubbing fields for downstream tools, specifically, but as far as privacy and data retention issues, scrubbing PII is pretty important. So, anything you want to jump, you want to talk about on this on Max?

Max Werner:

Yeah, I mean there's what CCPA, GDPR, there's so many [inaudible 00:26:25]. There's only going to be more. So, if you're, especially on the marketers here when they come in and like, here's this cool new machine learning service that we just purchased, and we need to pipe all our data into that, and they'll build models for us. It's like, they don't need to have to PII. They're building it off of the main street. So, being able to directly remove things. Again, in this programmatic very easy to maintain way after saying, "Here's my sets of fields that should be removed from this destination," directly saves you from having any of that stuff in those systems. I mean, you can either remove or you can obfuscate whatever you want. So, it's just really, really handy for preventing problems in the future. It's kind of like the theme of transformations, preventing problems in the future.

Ryan McCrary:

Yeah. Ideally, we're doing these proactively, but a lot of times they become retroactive as well. Cool. Well, we'll jump to a couple of examples, and Max, again, feel free to interrupt me with any of these, but we've shown some code examples. We haven't really shown... We've referred to these JSON objects. We haven't really shown what an event looks like. This is the RudderStack UI. This is where a lot of people are primarily modifying, creating, managing these transformations as well as do the API.

Ryan McCrary:

This is the example that I took a snippet from earlier. This is going to have the kind of enrichment piece of it. So, again, and this is something I'll jump into as well. This is basically grabbing from a shared library that we have that again, this is one of the things small, and this is just an example I use. But as Max mentioned, there's oftentimes things that we're going to do on every single destination that we're using. And so, we can abstract that stuff out, put it in shared libraries, and import them just as you would with any other JavaScript. So, this is the one we're using where we're hitting with Clearbit API. So, this is doing just a very little bit of, a minimal amount of enrichment, but reaching out, grabbing that, and then enriching the event.

Ryan McCrary:

In this kind of test framework, you'll see that there's an example identify call on here, and there's you can toggle between all the different calls. You can actually, and we'll show in a second, but you can... This is editable. So, you can paste in an actual event from your front end, or from your warehouse action ETL tool, whatever you're using, and see the actual output that you're going to get actually from one of these transformations. So, again, we're going to test on this. You can see that there's a little bit of lag because there is a third-party API there. But on the input, we have just these traits that were being sent in. And then you'll notice down here on the output, we do have these Clearbit keys that are added that are formatted ready to go for Salesforce, and we can release the sales hounds on these guys.

Ryan McCrary:

Again, this is just showing just the UI and how the test input and output work in the API. You don't really have direct access to this other than a local workflow that you may put together. But in the GitHub action, this is exactly basically what you would do for generating those anticipated test results. So, you would provide, and we have some, just like we do here in the UI, we have just a random identify call included here. You can use the one that's in the GitHub action, or you can generate your own, and then just any testing framework you'll put your expected output, and then that'll have to pass for that to be sent to the API.

Ryan McCrary:

I also mentioned this shared library. So, I'll open a new tab, and take a look at these. But in the transformations part of RudderStack, we have all these different transformations. You can see the destinations that are connected to. This is a demo account, so it's not a great indicator of what this would look like, and my naming is what it is, add an event. But we have these libraries down here. And so, this is a simple one that I was importing from. This is really just a way to not have this regex scattered all about all of my different transformations. And then, again, we're filtering out just RudderStack email addresses. And then if someone's using their personal email address with a broader test, and it will filter that out as well. But this was one thing where I came into the company.

Ryan McCrary:

RudderLabs was our old domain. And so, I had built this library to do that. I hadn't actually... I lie. I had not built a library. I just had this regex scattered about. And we had some leads making it through from our internal emails, and I realized, "Oh, shoot, we used to use RudderLabs as our email address, and we saw some people that are using it for forms." And so, by extracting it into a function like this, easy enough to just go in, add that domain. And then it's already included in all of our other transformations that are pulling that.

Max Werner:

Yeah, one use case that I've seen for this kind of email, especially checking is from a lead scoring perspective you have your domain names for all the public email providers like Gmail, Hotmail, Apple Mail, whatever it is. There's still people with Hotmail out there as well. [crosstalk 00:31:29]. You can do what you want, right? You can filter those out, and say, "Those are never going to be converting leads," or you score them differently. Again, just keeping that somewhere separately. So, if someone wants to add a domain to the list of undesirables, so to speak, they can just go into anyone spot, and the new thing in and it is automatically used everywhere. So, very handy here.

Ryan McCrary:

Some sales guy somewhere is screaming at you right now, Max, because of a Hacker News article of some big-time Fortune 500 CEO who uses the Hotmail email address, so we're missing hot leads right now.

Max Werner:

That's why I'm not in sales.

Ryan McCrary:

This is an example. This is just a... I'll just do a quick little, quick and dirty example. But just showing what we mentioned earlier of just the data cleaning. Where is it? This is a product click event. So, this is going to be sentence cased. And so, say we've use the data governance API and found that someone was sending that, and we do everything underscored. We could go ahead and this is kind of what I mentioned earlier. Look for that sentence case and then I'm even going to go ahead and preemptively know that someone out there in our team is snake casing it. And just a simple event or a simple transformation that's really just converting that into what we want to see.

Ryan McCrary:

This is important for a number of reasons, but one of the big things, especially with a tool like RudderStack, if you're warehousing this data, specifically these track calls. As you know Max, we're creating tables for each event key. And so, now, if someone is sending product clicked as sentence case, now forever when you're pulling reports in the warehouse unless you go and clean this up you're having to-

Max Werner:

Join tables.

Ryan McCrary:

... join across both of those tables. Yeah, exactly. So, yep. So, has far-reaching impacts on individual destinations, but as well as just the hygiene in your warehouse. We actually had a pretty critical issue recently with a larger customer of ours that was using... They were using a third-party browser extension for some of their signup referrals. In the RudderStack JavaScript SDK, which is pretty common in this world we add any UTM parameters that are specifically UTM underscored to the campaign object. So, in this event, it would automatically under the context that campaign. It would have, we would already go ahead and parse out those UTM underscores. Well, this browser extension was appending a ton of UTM underscored keys. Not a ton, but a handful per event, but for some reason, they had some kind of weird UIDs in them.

Ryan McCrary:

And so, what was happening is the way that the way how schema works, we add a new column for a new key so that when that's coming in we're fetching the schema, we say like, "Oh, there's no UTM underscore product in this table." We'll go ahead and add that column. Well, each event had its own unique UTM params. And so, eventually, they filled up all 2100 rows in Big Query. And then they had events failing because they had new UTM params. [crosstalk 00:34:54]. They couldn't load the events because we were trying to create columns that didn't exist. So, they're losing data because of a bad data hygiene issue where they're literally just creating, generating tons and tons and tons of keys.

Ryan McCrary:

They're a FinTech company. And so, we're on a call with their team, and they're panicking of whether they can... They don't know what these columns are. And so, we're continuing with this data while they're discussing as a team if we can remove these columns because maybe this is an old campaign from somewhere back that we need to attribute something to. And so, just things you don't want to deal with. So, the fix for that P0 issue for them was a transformation. We got together with their team, decided that they could confidently say they weren't using UTM params in this specific case, and so we're able to just delete the whole campaign object from all those events. And then all they had to do was clear a couple of columns, and all of a sudden data started flowing through again. So, kind of things you don't think of being nightmares.

Max Werner:

Yeah, for sure. I mean, fixing that in five lines of JavaScript is, of course, much, much simpler than the three days you just spent in SQL to fix that table and backfill data from S3 or try or wherever it's stored. Yeah, for sure. That sounds fun.

Ryan McCrary:

Yeah. Cool. And then, the last example, this is one of yours that I'm going to steal, Max. This is specifically for warehouse actions. So, when we're talking through most of these events you come down here, and you look at the event structure, and it's like, "Man, this is a very pretty JSON object." Everything..

Max Werner:

From an event stream.

Ryan McCrary:

Yeah, it's from an event stream. It's from an SDK that's made to do this. So, it looks exactly like we would expect. We can traverse all these keys exactly how we want to, and everything is great. But as you may know, and I've got one of your events here, stuff coming from... Events coming from an ETL source, our cloud extract, or from reverse ETL from warehouse actions, they're not quite as pretty. We're doing a little bit, and they're coming from our warehouse today. We're going to have some different column names and mappings. And so, we've got this not nearly as a pretty event here, and we don't want to dump this into a tool as is because it's just not going to work.

Max Werner:

In this case, you just end up with a JSON string and your downstream tool, and that's not helped, anybody.

Ryan McCrary:

Yeah, then we're back in a situation of having to clean up with a bunch of SQL downstream instead of a couple of lines of JavaScript.

Max Werner:

The idea here being in this kind of thing, you see there's two keys in there, old truth and a new truth is basically a warehouse action that looks at a table that has between two different runtime, different states off an account or a user object, basically. It makes a JSON string out of that. And then the function is basically designed to take this all in and compare it to see if you actually need to do anything downstream with that because you can, even if you do the string comparison and say, "Is this JSON string the same on the warehouse level," and only set the ones that are different. You can still end up with things that you don't need.

Max Werner:

In the new round, there could be things that have been dropped like an event key [inaudible 00:38:19]. And that would make the string different compared to the last round, but doesn't mean that you need to send that event downstream, which is especially important for API rate limited tools. So, we basically just build these couple of things that rebuild that JSON string into a proper payload object and then just compares the two, and see is there actually things that need to be synced or not? And then the output becomes a much, much nicer event stream.

Ryan McCrary:

Yeah. So, yeah, more of what are downstream tools are expecting to see. Max, correct me if I'm wrong, and this may be a simplified version of it, but this was actually at least part of this was something that you built with some of our team to mimic SQL traits within Segment based on a warehouse that, is that right?

Max Werner:

Yes, that's it. And it's not even simplified. That is that code.

Ryan McCrary:

Cool.

Max Werner:

Again, what is that? 53 lines of JavaScript?

Ryan McCrary:

Yeah.

Max Werner:

That solves that problem. Because the problem behind it was that we, of course, we only want to send to change records to partially not incur unnecessary rather events, but also not to hammer downstream APIs with hundreds of thousands of records if only 5,000s have changed and actually need something updated. So, this with a little SQL magic to get these different runtimes together, basically mimics that SQL traits from the second persona very, very nicely. Yeah.

Ryan McCrary:

And I mean, this is clearly RudderStack voice speaking, but 53 lines of JavaScript is a little cheaper than SQL traits.

Max Werner:

Yeah, I mean, depends on how much you charge for a JavaScript line as a consultant, I guess. If SQL trait is the only thing that you use second personas for, you're hitting me up, and I explain to you how I did this because this is a lot easier and cheap.

Ryan McCrary:

There we go. Get a little sales time in for Max as well. Cool. Before we move on to the last little bit, anything about transformations or anything you want to walk through or show, Max?

Max Werner:

The metadata object is a needle. You can see from the piece that you get there that event and metadata event fairly self-explanatory. It's obviously like your JSON object, but the metadata pieces need, which I think you can do, it tells you everything extra like the source that you're going to log it out.

Ryan McCrary:

Yeah, I'll go ahead and show it. And then let's do event properties. Yeah, so this is a good look at what this metadata object is. So, as Max mentioned, you get this as one of the arguments in this transformer... And I guess, while we're talking about this, I'll walk through this because this is one of those things that I'm a little too close to, it just makes sense because this is what I do every day. But like we said earlier, this is just JavaScript, right? So, the only expectation of a transformation is that we have to have a function called Transform Event. And we have to return... Well, we don't actually have to return an event, we just need to have some type of return.

Ryan McCrary:

So, as you can see these inputs and outputs are all an array of a single object. And so, really, again, as long as we have this transform event, and you can see I think in this one that you have Max. This is the transform event function, and then we've got additional functions outside of that, that we can reference from within. So, really is just truly vanilla just no JavaScript. And so, yes, I glossed over that. But yeah, I mean, as long as there's this. There's actually, and this is in our documentation, but there's transform batch as well that you can use to do some sampling or aggregating across events in a batch. Man, fine here.

Ryan McCrary:

And so, the second argument... So, the first argument is always going to be that just JSON object, that event. And then the second is optional, but the metadata object, which is going to give us some kind of freebies that we have. And so, you can see I just added those the properties key. I would show you in the logs, but this is my test account. I think the logs are broken right now. But we do have a log down here where you can log from within what you would expect to do with that log in a browser setting and log-out data there. But it's going to give us the source ID, the destination ID, the message ID, which is this is something that's going to be unique to each message coming in.

Ryan McCrary:

This is common where you might want to generate multiple events from a single event. And so, it's important that we're incrementing this message ID so that we're not... If we're generating two events with the same message ID we can run into duplication issues. So, being able to just basically, we do this internally a couple of times where we basically just take message ID and append one to it or something just to make it slightly different. The job ID which is more of an internal reference for us debugging at RudderStack, and then the destination type, and the session ID which you can use to calculate your own sessions.

Ryan McCrary:

These are valuable for... The source and destination IDs, if we go back and look at our connections page, these are going to be not to be confused with the right key, but these are going to be in the address bar up here. So, this is the source ID here. So, we can actually use that metadata object to do any kind of operation we want on that event, if it is or is not from a specific source or is not to a specific destination, which goes back to what you're saying earlier about being able to share some of these transformations across connections.

Max Werner:

Yes, the way and for some people that might not know, but transformations are attached on a destination level. So, theoretically, you can of course have multiple sources flow into the same destination, right? With the help of that metadata object, you can say this is this, okay, I want to for instance from this test source or the mobile app or whatever where I get tons of bad signups or something, I want to do specific logic only related to that source which helps you make a transformation almost on not just the destination level, but on a connection level because you can do these combos of source and destination via that source ID piece in the metadata.

Ryan McCrary:

Yeah, that's a good point. Yeah. I mean, here in the UI, you can see the kind of pipelines that go. And then, these little purple kinds of targets are the actual transformations that are applied. So, you can see they do live on the destination, as Max said, and so sometimes it may be like we mentioned earlier with something like Salesforce. Maybe that we're cleaning these custom fields regardless of the source for Salesforce, or it may be that we're only sending specific events from specific sources dropping or filtering those, as well, so it gives you the power.

Max Werner:

Something so that you know if something comes into Google Analytics from this one test source it automatically is called something prefixed, and that helps you filter for your own test data.

Ryan McCrary:

Yep. We actually have some customers that use this where they are basically collecting and aggregating data on behalf of other customers. And so, they may have maybe a consultancy or similar where they have six or seven customers they're using in a single dashboard. And they're piping that data to their specific Google Analytics, let's say, destinations. And so, they can have a number of Google Analytics destinations because we don't limit how many types, how many destinations of a single type you can have in here. And so, they can send all those and then use the same transformation to basically if the key, if the source is this, then it is going to this destination if it's not filtered out. So, you can use it for some kind of unique things like that as well. Yeah, that was a good catch, Max. I didn't even mention the-

Max Werner:

It's one of those things. I mean, lots of these transformation-type things are things you don't know you needed until you do. And then it comes in real handy.

Ryan McCrary:

Yep. And one thing I'll hit on briefly. I don't know if I have this transformation in here. But this is a little bonus feature that we have. No, it's not in here. But I'll mock one up from scratch. But with a webhook destination, I mentioned earlier that in a lot of cases between the combination of a webhook destination and a transformation, you can do a lot of things. You can get that event or that data in the shape that you want. We actually have, and when you set up a webhook you can add static headers to the webhook destination from within the UI, but we actually within a transformation you can actually access those headers, and change them dynamically.

Ryan McCrary:

So, again, if you're using... If you're reaching out to an API to enrich those, or if you are... From the previous example of if you're routing data for other customers, you can change some of those headers based on you could have a secret mapping or some type of mapping in here for those different URLs. And then add those as well. And then if it's a good webhook, or even I think with the post you can actually append query params from within the transformation as well. You can access that object dynamically from here. To that point, our goal is to make the webhook destination as extensible as possible. So, if anyone out there watching, Max included has any thoughts on some other things that would be nice to grab or modify from within the transformation for a webhook destination, we're all ears on that.

Max Werner:

I mean, that's awesome. The other side of the stack would be the ingestion.

Ryan McCrary:

Yeah, so that's a good segue into something that was released literally today is the ability to, and we've talked mostly... And this makes sense, right? Because they live on the destination they're transforming data that's already been ingested in RudderStack. We typically think about transformations with regard to those downstream tools. But with the addition of warehouse actions, and then even with this new webhook source, we now have to consider, kind of like the example I showed Max of that interesting-looking event from your warehouse that we made a little prettier in a transformation.

Ryan McCrary:

We have the ability now where there's a lot of tools where we have customers that are ingesting data from a cloud source, from an ETL kind of action, but then there's other types where they don't want to do it on a batch reach out and grab the data and sync it. But they want to be more reactive and have that data coming in real-time. Because of the limitations up until now of the RudderStack API, we need a user ID or anonymous ID to get roughly the RudderStack API structure to ingest those. But now we've got this idea of a custom source so you can attach to a specific right key. You can just send any webhook that your heart so desires into RudderStack, and I'll show you... So, this example is MailChimp.

Ryan McCrary:

We don't have a MailChimp connector. But now when any action happens in MailChimp you can select your... This is a test account, but you can put your RudderStack URL here and choose what you want sent to it, and all of a sudden those events are going to be coming in. They're going to look like the following. There'll be a track event. Just generic webhook source event and then we'll just give them a randomly generated ID. And then we will flatten out whatever properties are coming in that... And, this only supports post actions right now, but any of these are just going to be flattened based on... And you can see, this is going to look similar to what you're doing from a warehouse action where it's not quite as pretty keys, but we can do whatever we want with this. So, it's pretty nice to be able to, whether it's something like MailChimp, or even if you're trying to attach customer data from something like Stripe or Shopify or anything like that you can now ingest that data into RudderStack in real-time, and it's going to join just as an event stream.

Ryan McCrary:

But as you can see, we mentioned this in conjunction with the idea of the transformations because this is clearly going to need some work. There's not a lot of downstream tools that are going to be thrilled about this, and even our warehouse is not going to look great unless we're specifically designing it around this. But, again, this is coming in just anything else to RudderStack, just the JSON object. And so, we can do what we need to do from there. And you can see ways to pretty it up. We've got some sample transforms and working on the docks around that. So, yeah, man, that was a great-

Max Werner:

This will be a great combo with the metadata, right? Because that's a transformation you probably have on your warehouse destination where you need to do this logic of recursively going through event keys or something. But you really only just for your own sanity, you want to do on a very specific event source, which in this case would be that that custom source or webhook source because that saves you from testing a whole bunch of use cases from regular event streams that go in your warehouse that you don't need to check that way. [crosstalk 00:51:46] do this weird stuff for the source.

Ryan McCrary:

Yeah, exactly. These are likely going to be very specific, hopefully. So at least. And then I know we're coming up on time. So, I'll just hit on a couple of things real quick. So, I mentioned we won't go through this. It can be involved. In our documentation, this is the docs for the API specifically for these transformations. So, creating libraries, creating transformations, again, can all be done and sent via API, so that you can, instead of using the GUI that we showed you can manage that however makes your heart the happiest and push that code. And then further, we have this GitHub action, and again I won't go specifically through it because we're close on time. But is a pretty cool tool, right? So, it allows you to send these transformations and libraries even. And then again, we have some baked-in test events for them. But you can specify test input and expected output file, which again, are just going to be JSON. And then assuming these guys match up after you push your code, it'll automatically push and apply those via API. So, this is a much-clamored-for feature that we've finally released, and pretty excited about it from our end.

Max Werner:

Yeah, definitely. I mean, just from a scalability perspective to have that in your own repository, and have it automatically test so that especially when somebody with less JavaScript experience, shall we say, goes and says, "It's an easy change." And then, froze eight errors they're like, "Well, not that easy to change."

Ryan McCrary:

Exactly, yes.

Max Werner:

Because I mean, that's the one "downside" of the transformations is that you can break your destination by accident if you just go in, and you add some code that is bad, or has no pointer errors or something like that, then that we will not go anywhere after that. So, this especially with testing combined, I can see is a very, very helpful feature.

Ryan McCrary:

Yep. Absolutely. Cool. Well, I think that's it. Anything else to add on to that Max?

Max Werner:

Not that I can think of.

Ryan McCrary:

Cool. Well, thanks for joining me. Yeah, as mentioned on the RudderStack team, but Max if you want to pitch yourself one more time, people know where to get you.

Max Werner:

Yeah, obsessiveanalytics.com or LinkedIn you can find me if you want to nerd out about data engineering, and the frustration that can come with that, hit me up any time.

Ryan McCrary:

Cool.

Brooks Patterson:

All right. Well, Max, Ryan, great session, covered a ton of ground today. Thank you, everyone, for joining us. We will follow up. If you register, we'll send out an email with a link to the session from today so you can go back through. Also, if you're interested or have more questions, please join the RudderStack Slack channel. You can get help from Ryan and his team as well as just our growing community of developers. And yeah, certainly follow on social media, everything and we hope you'll join next time. Thanks.

Ryan McCrary:

Cool. Thanks, guys

Max Werner:

See you.

Ryan McCrary:

Yeah, bye.

Transforming Data: RudderStack Vs. Segment

Company

Company

Product

Product

Read our documentation

Resources

Resources

Join the conversation