SVG
Events

Event Streaming Why Beyond Batch is the Future

What we will cover:

  • How did we get where we are today with batch systems?
  • Why is batch still so popular?
  • What can you do with streaming?
  • Why is streaming the right approach?
  • What are the pitfalls of streaming?
  • Why is moving beyond batch the future?

Speakers

Kostas Pardalis
Kostas PardalisHead of Developer Experience at RudderStack
Joe Reis
Joe ReisCEO/Co-Founder at Ternary DataJoe calls himself a "Recovering data scientist". He's a data architect and data engineer helping companies build solid data foundations.

Transcript

Brooks (00:00)

Well, thank you so much for joining us today, everyone. I am Brooks, I'm with RudderStack. I'm a marketing manager, and I am excited to have you all here for our webinar today on why Beyond Batch is the future. We are very excited to have a great panel of folks to chat about data processing with you today. So what we are going to cover really is how we got where we are today with batch systems. Then we're going to talk about why it's still popular, move into talking about a bit streaming, why it's the right approach. But what some of the pitfalls of event streaming are and talk about what does the future really look like for moving beyond batch.

So to really give us the lowdown on all of this, like I said, we have a great group of panelists, Joe Reis and Kostas Pardalis. I will do quick intros and then let you all expand a bit and can quickly just hit really the kind of catalyst of our webinar today. Probably about a month ago, Joe was on a data stack show podcast, which Kostas hosts, and we got into a conversation really about batch data versus real-time, and Joe made the point that essentially all data is real-time and that batching data is something that we as humans do because of technical limitations, but that's starting to change. So that's really again, the catalyst for our webinar today, and we're really excited to dig deep on that subject today. 

Joe has a very background in lots of different data roles in operations, in data science, data engineering, and now is in a consulting role as the CEO and co-founder of Ternary Data. Kostas is Head of Product at RudderStack. We are simplifying the data pipelines and providing a unified solution for moving data. So with that said, Joe, I'll let you expand a little bit and Kostas so you can do the same and we can jump into our conversation.

Joe Reis (02:38)

Cool. Hey, thanks. Yeah, so my name's Joe Reis, as mentioned I'm with Ternary Data. So we're a data engineering, data architecture consulting firm based in Salt Lake city. And so I think we're known for helping companies evaluate the right data stacks and architectures. Then, unlike a lot of other services companies that just come in and charge you hourly for a bunch of button share hours, we actually teach you how to use these tools in production. So we empower your teams to be better versions of your data selves.

Brooks (03:20)

Great. Kostas, do you want share a little bit?

Kostas Pardalis (03:24)

Yeah, of course. I'm Kostas Pardalis. As you Brooks said earlier, I'm head of product here at RudderStack. Before RudderStack, I was CEO and one of the co-founders of Blendo. Blendo, was cloud-based ETL solution. So again, I was building products around moving data around. And yeah, I'm pretty much working. I have an engineering background. And before that I was still like, I think I've been working with data related technologies for more than 10 years now. That's my passion. And especially building products for data that are used by engineers and developers. So yeah, I had the amazing chats as part of the [inaudible 00:04:21] Joe. Many insights came from there and this is a good opportunity to expand on these insights and have a deeper conversation around the difference between batch streaming and what's going on and what's going to happen in the near future in terms of this technologies and practice.

Brooks (04:41)

All right. So I appreciate the interest, Joe and Kostas, and excited to jump into our conversation about data processing today. And like I said, we really want to start with how do we get to where we are today with that system? So Joe, I'll let you take this one first and also want to see it does either of you have a read or how did I get here guy?

Joe Reis (05:12)

Yeah, it's a talking heads, David Byrne.

Brooks (05:15)

Yeah, that's right. Well, Joe, can you help us figure out and help David Byrne here, understand how we got where we are with batch systems?

Joe Reis (05:27)

Well, I can certainly answer David Byrne how he got to where he is, and it's a classic video and song, but as far as how we got the batch, I think it's an interesting question because in one hand, we've always existed in a world where data has been generated as events. But batch systems, I think more accommodated the limitations of computational systems as well. And if we're talking about an analytics use case, which I think is the focus of today's chat, right. A lot of it had to do with the limitations of storage and compute historically.

So if you harken back to the early days of computing, even say the 50's and 60's and 70's and 80's and so forth, it would essentially because data might be entered as it might be saved in a real-time, but then of course getting the data out of those systems required a lot more overhead then, you might find the capabilities of today's systems where you can just read records automatically as changes occur. And so I think that's really where the limitations started. And if you look at older analytical systems, we'll start with maybe Teradata, for instance, I think that was one of the first modern data warehouses back in the late 70's. 

It was a very expensive system and it was completely batch, right? So you had to load it in batch and you had to read from it very delicately because it didn't have a lot of horsepower, but for the time it was very sophisticated. As time went on, and I think as people started using commodity hardware to solve a lot of distributed computing problems, both for storing and writing and reading data, I think the notions started changing in terms of what could be possible with batch systems. Of course, this also sort of parallels the use of logs and systems, right? So in databases you've always had the notion of time-based logs, but again, those were typically data from databases was always read in batch, but I think now that compute and storage has caught up with the needs of streaming and maybe things are changing a bit. So what's your take Kostas?

Kostas Pardalis (08:10)

Yeah. I agree with your Joe. I think it has to do with the evolution of how we work with data. And I think that someone just better understand why we are here today and we need to start from where we started, right? So years ago, database is not something new, right? It's one of the first things that we have created as part of the technology. But there are different payloads when it comes to databases. We have the OLAP case, which is mainly for analytics purposes. We have OLP, which is the transactional databases that post SQL for example, that we could use to drive an application and this two things need to communicate. So traditionally what was happening, we have a transactional database data is created there. So of course the data is appraised at that point, right?

So you need to have a process who is going to pull the data from there and push it into a data warehouse, which is an OLAP system to do the analytics. It made a lot of sense to do it in the batch fashion, not because the data did not have a standard nature. Of course, they did, but we were putting that buffering between every way, which was a transactional database. And of course there were technological limitations there, there was cost related issues. All these things that Joe mentioned. So it made a lot of sense to just follow a batch process. And I would put another dimension here, which is how is it these to reason about the process itself, how you can identify issues and how you can react to issues on a streaming system compared to a batch system.

So a batch system is usually a bit safer. It's easier if something goes wrong to repeat the process, data is more difficult to get lost in the process. So it made a lot of sense to have this kind of let's say architecture, and of course, the needs from the market were not that real-time, right? 20 years ago marketeers didn't want to uncover things like real-time, right? We were not like none of the functions of say the company was maturing now to go and use realtime data. Now this things are changing and today we are facing a completely different reality. And yeah, makes much more sense and it's much more feasible to support them have streaming systems.

Brooks (11:00)

Kostas, we've hit on a lot of the reasons that it made sense to use batches its safer. The needs really just weren't there for real-time. And getting kind of alluding to streaming, before we get there, Joe, tell us a little bit about why is batch still popular? If we do see use cases for more real-time if technology is advancing, why are folks still using batch?

Joe Reis (11:38)

I think that the simple answer is that it's really convenient for a lot of reasons. So let's unpack the convenient part. You would have convenience in the sense that a lot of systems, analytical systems, OLAP databases still tend to favor a batch paradigm, right? So even if you're talking about popular data warehouses, a lot of those aren't built to adjust streaming data. You have to micro batch it. To ingest it, you're going to probably break it. So it's columns for databases are widely used, but there's definitely some limitations. And I would say we just, and people have grown to tolerate batch. I think we'll touch on that in a bit.

And I think that's going to be less and less of a tolerated thing, but on the source system end of things, so if we're starting a data engineering life cycle, like on the source systems, a lot of systems still I wouldn't say favour batch, but getting data out of these systems where they talk about OTP database or an API, it's just a lot easier to support records in bulk. And then, through an ingestion mechanism and load those into a target OLAP system.

It's also a lot easier to hire for batch. So if you look at the job requirements of maintaining a streaming system like Kafka or Flink or Spark, it's a pretty heavy duty skillset that's very expensive. And so, I just think the confluence of these factors, and I'm probably missing a few just lends a sense of convenience to batch, right. It's familiar to a lot of people and it's easy to wrap your head around. Yeah, you're not going to get your reports maybe up-to-date for each hour or maybe for each day, but at least you'll get something. And so I think people tolerate it more than anything, so.

Kostas Pardalis (13:53)

Yeah, absolutely. I totally agree with Joe on that. There's always inertia, right? There are people, the market itself, the industry was used to work batch and thinking batch right? Things does not happen that fast even in the tech industry. Although we always think that whatever happens in tech is happening really fast and change is always constant and all that stuff, but still you have people who know how to set up the systems that are technologies that are built around the systems and all these things take time to change. So that's one of the reasons. And as I said earlier, it's just easier to work with batch systems, it's much easier to reason about, it's more safe.

There are also it's very interesting because I think there are also some paradigms out there like the Lambda architecture, for example, which if someone's sees the typical example of Lamda architecture, you have the streaming sides of things and the bad side of things. So they make them completely distinct in this architecture, which was and still is probably quite dominant when we are architecting data infrastructure. And the main idea was that you stream the data if you want to do things like notifications, like realtime troubleshooting, things like this, right?

So there was no concept of using these streams to feed systems where you could do analytics or go back in the past and use historical data to do that. And then you had the batch system, which was exactly for that reason. It was, okay, this is the slow process of pulling the data, pushing the data into the data warehouse or the data lake, let your data engineers and data scientists go there, build their models and all these things. And people follow, people are mainly followers, right? There is a paradigm out there and we tend to follow this paradigm.

I was doing the same thing and I'm still doing it, but things are changin and I think that we are going to see some kind of merge of this Lambda architecture or something where it's going to be the differences between the streaming and the batching processes are not going to be that different. By the way, because Joe mentioned Kafka confluence the company behind Kafka and the people behind Kafka, they came up with a new model codes and new architecture called Kafka architecture and they try to do exactly that. They were trying to unify everything under a streaming paradigm. So it's happening just like everything else takes time.

Joe Reis (16:50)

Yeah, it takes a lot of time. I think Kappa is an interesting one. I've seen some companies attempted it's and we'll talk more about this. And I said, when we talk about maybe some of the pitfalls of streaming, but I think some of the... There's some challenges around that architecture. It's a bit unwieldy if you're not really clear on how you're going to manage the systems. So again, that's why people, I think just default to batch even when they try something like Kappa. God, this is pretty hard. Then you end up with sort of Lambda and sort of Kappa, and then it sort of none of the above and all the above. So, yeah, it's interesting.

Kostas Pardalis (17:33)

Yeah, I think you're absolutely right, Joe. First of all, like everyone who has tried to set up and put into an operational mode, something like Kafka or Spark, these systems are not easy, right. It takes a lot of people with a lot of knowledge and it's still, if we want to use a broader adoption and not just like in the labs at a price, we still need more tools to be built out there and improve the ergonomics around working with data streams and batches of all these things. These are things that we need to work to make them more approachable to more engineers out there.

Joe ReiTs (18:17)

Sure.

Brooks (18:18)

So things are moving past batch, but they're obviously still a lot of reasons to continue using batch. What are some of the things that we get excited about thinking about streaming, thinking about moving beyond batch, Joe?

Joe Reis (18:39)

If we take a step back and understand, okay, so what's going to drive streaming and the growth of it? I think there's a stat from ADC. It was like 30% of data is going to be streaming data by 2025. I don't know what the real number is, it depends how you define streaming as well, but it's some use cases, right? What this means is as more people are using maybe mobile devices, for example, or just interacting with apps, IOT, self-driving cars that kind of stuff, that's just a lot of data that's being generated. And so I think there's some, I'll talk about the cliched examples and I'll talk about maybe some stuff that... So the cliched examples is, you can do anomaly detection on streams, right? That's a classic one.

What else? You can look at e-commerce transactions and identify and respond to, you can provide personalization in real-time and react to inventory levels for your e-commerce site in real-time. Maybe if you're running promotion, you don't want to run promotions of inventories at zero, for example. Self-driving cars obviously, it's a streaming use case. You ever thought of one, there's just a lot of data being generated off a self-driving car, and it's a lot of things you can do with the data there. So let's say there's a lot of things you can do streaming. If you zoom out too, it's like, think of how you interact with your environment right now and you always have even before we had the internet, right?

You interact with the environment in real-time. There's not really a sense of, it's even past your present. It's kind of you're just responding to the things that are occurring around you. And that's where I think is to me always been the holy grail of streaming. It's just computers are interacting with the environment in such a way that it seamless. So obviously, there's a bit of ways to go with that in the universal sense, but if you can imagine everything in a business, for example occurring and being responded to in real-time, that'd be an interesting thought experiment. Why do you have to have lags? We have lags because you have bottlenecks, but if you do a thought experiment, if you've just removed all the bottlenecks and he had no lag between when an event occurs and when something needs to be acted upon, I find that a very interesting thing to think about. So if I get asked, what can you do a streaming? I would say for business, that would be an ideal, so.

Kostas Pardalis (21:39)

Yeah. What I would add to what Joe is saying that in your question, you pretty much you can do everything with streaming. It's not a matter of if there's something you can do with streaming that you cannot do with batching. Even in more, let's say farm situations where you are doing, you're training models, right, or you have some advanced algorithms. Like in most cases you will find a streaming version of the algorithm. So it's not so much about the limitations. Again, I will go back to the concept of ergonomics, how is it to do something with streaming compared to batching today? Some of the difficulties might be because of the nature of the two paradigms, which then we can debate. I think for some cases, something might be easier to work with.

But at the same time, a lot of like, let's say the issues or the difficulties of working with streaming. Again, I think that it's lack of tooling. I'll give an example. I was experimenting and playing around with technological materialize, right, which takes streams and can create like in a very low latency materialized views of this data. And you do that by executing SQL queries. So you get the sense of interacting with the database. But actually what we are doing is you are interacting with live data streams, right? And the experience was amazing. It was really easy to set it up. It was really easy to connect to Kafka stream, or Kinesis stream or even connect to some local files. And it was magical. You could attend entries on a file and you could see in milliseconds that the table, the view was updated.

These are the type of tools that are missing. And as we create more and of these tools, I think that we will see more of the stuff that we traditionally doing with batch processing to become available in stream version. And the biggest benefits of course is latency. That's the bottlenecks that Joe was talking about. You remove all these bottlenecks and actually you can react to whatever is happening as soon as possible. And that's super valuable for in business. That's what we are trying to do with technology. Now, there might be some specific cases where batch will be still important, but we can discuss about this questions you have Brooks later.

Brooks (24:33)

Yeah. That's great. Joe, you mentioned that the holy grail really is that machines interact with the environment the same way we do as humans in real-time. So why is streaming the right approach? Is it just because it's the best way for us to make that reap from batch to real-time?

Joe Reis (25:01)

I think it's a right approach. So there might be other right approaches that we aren't talking about and haven't even thought about yet, but I would think that like events, if you take a step back and look at what events are, right? So the nature event is like or the intention of events is you can notify right, about something happened. The source of some generating the notification doesn't need to do anything. Or there might be some trade of state like, so a system will tell you that it did something and that it expects a response back.

And so then there's a lot of other variations of types of events. But all these events, so they... Well, let me take a step back and also maybe give an opposite example. And so right now, when somebody enters data into a form, the data just goes into a database and then what happens? No idea. So maybe I take the data and do something with it. A system may use that event to trigger some downstream action. But in the former case, if it just sits there, what good is that? You might use that data for later, that's fine. But if you're able to have reactive systems that simply respond, maybe as events are generated, I would think that would give your organization, there's a lot more agility to be responsive.

And you see this even with, there's endless tools out there that do this right now. Like JPR comes to mind as one that's just more of a consumer end product where things happen. You can engineer pipelines to do stuff. I think what's starting to happen though, is you're starting to see more intelligent automation with streams, as well as costs points to incorporating machine learning. This has been missing in a lot of cases. Because it's been very futuristic based if else statements, for example, or case statements or something like that in code.

But having intelligence system just decide and action take, a lot of bigger companies have been doing that for a while, but I think the democratization of that will be fascinating to watch unfold as those same capabilities start becoming available to small, medium businesses, for example. And so why is to me the right approach? It's the right approach because it removes bottlenecks. So unless you like bottlenecks. I think bottlenecks actually serve a purpose in some cases. Bureaucracy is good In some cases. You may want things to move slower, so. But I think if you have the option of doing streaming and implementing it when it's necessary I think that it should be I think the default approach, so.

Brooks (28:18)

Kostas, anything to add?

Kostas Pardalis (28:59)

Not really. To be honest, I think Joe gave an excellent description of why streaming is the right approach. I wouldn't just start with something that's a bit more philosophical. I would say, at the end time is probably the most valuable resource that we have. And that's why we want automation and that's why we have technology, right? We are trying to utilize our time in the best possible way. So and streaming is exactly that. Is a paradigm where when something happens as fast as it happens, we react through it, we don't wait for something else to happen and then we'll see and all this thing. So that's my opinion. Outside of this, I think that Joe gave an excellent description of why streaming is the right approach.

Brooks (29:19)

That's great. Well, Joe, you have hit on this a little bit, but tell us what are some of those pitfalls to streaming? What's a pitfall here you need to be on the lookout for when it comes to streaming?

Joe Reis (29:35)

Gosh, a clairvoyant about my childhood in the 80's, there wasn't a lot of talking heads and played this game a lot. So the pitfalls, okay, so I think there's a few. Obviously, we talked about in the first part of the slides really were batches for the default still here to stay. I think the pitfalls, I would say not up, but to streaming, it's just basically inertia and organizational willpower to change. That said, some of the pitfalls that I see of streaming systems right now are basically on, if you go to the source of some end of things, you may have to pull from APIs.

And so you're going to have an inherent lack of there because not a lot of systems are set up to provide default streaming outputs, right, because some of those systems as well, they rely on basically batching up data in a transactional database, for example, and then sending it off. Change data capture is one where I think there's obviously people are seeing a lot of success with that. And then there's other things and log systems and so forth. It's just, I think the pitfalls we're talking about source systems or just I think, teams making sure that the source systems are capable of providing a good streaming experience and not also simultaneously crashing the production system that it's coming from. 

I've seen people try and do streaming out of source systems and then compromising their production database that they're reading from. That's never a good idea trying to do that. But then, you start getting into ingestion. I think there's no shortage of tools right now that can help with streaming. I think it's still immature, true stream processing system right now is I think the closest one that I've seen so far is Flink, and that's still got a lot of overhead to stand up. I think it's getting easier, but any of other ones like Spark, which I think do more micro batching and so forth. But you're starting to see more interested in the SQL system and you bring up materialize. 

It's another one that there's just I think there's a lot of promise in these types of solutions. But the question I have, you're taking this to cost, it's like, okay, so say that you do streaming. One of the big cruxes I see right now, one of the benefits of doing traditional OLAP is that you have the sense of dimensional modeling, the Kimball style for as much as people de write it. I think there's still a lot of benefit in Kimball in the sense where you can reason about your business definitions in a way that I think streaming, at least at this point, it makes it a lot more complicated to reason about traditional notions of facts and dimensions.

And again, facts and dimensions, I think are still a very... There's reason Kimball's been around for a really long time and people use it, and it's not because it sucks. It's because it works really well for a lot of business use cases. And no matter how hard people have tried, I've been in this game for a really long time, and I think I've seen people say, "Kimball methodology is on the way out, or dimensional modeling is like old school and it should be thrown into the dustbin of history." 

And every single time it's like I see people come back to it. So it's like, well, how else am I going to reconcile business definitions? That's a big crux I see with streaming right now is even if you can stream into like a Druid or something like that, which is more capable, I think of handling ingestion and streaming data, you still got that ticket, the business definition to contend with, so.

Kostas Pardalis (33:26)

Yeah, that's a very interesting point you're making there. And I feel like we are going to have, let's say some kind of repetition of what happened with no-SQL and SQL, where it was the end of the SQL database, right. Who needs them? But I think when a paradigm in architecture or technology is proven for like 30 years, 40 years, maybe I think there is a reason, right? There's so much value that has been built on top of that. It's proven. So I don't think that we should be trying to dismiss it. We should try and see how it can fit in let's say more anxiety or more flexible architecture in the future.

And I think that's one of the main differences that I think will happen is that instead of considering that you have a data warehouse and everything needs to be under these Kimball style, like modeling over there, which is the reason that in the past we have ETL and now we are talking more about ELP because we had to transform the data on the fly and to feed the data more than at the end, is that things will be much more relaxed in terms of the first layer of storage of the data. And I think that's what data lakes are going to become even more important, where you can store the data there be much more relaxed in terms of how the data is stored. 

And I think with streaming data, especially, and even from the database is using something like a CDC approach to pull the events from there. You can keep a whole Keystone of exactly what and when happened on your database. You can replicate whatever state you want in time, which is amazing. And it's tipping enough right now to do it with a storage off there on the cloud. And the data warehouse for them, they mentioned all kinds of analyses and all these things. 

It's going to be not the centerpiece of what we are doing in our data stack, but just one part that we take the data from something like a data lake, model it in the right way and drive very specific use cases inside the organization, like the traditional BI, which always will be there. It's not going anywhere. We need BI, we need to understand the past before we try to predict the future anyway. So that's what I see. And it took me a while to mature enough to understand that these kind of technologies are not going away. They just need to be repositioned. 

Joe Reis (36:13)

Yeah, I like the comparison with the no-sequel days because it was almost like this bratty teenager that was rebelling against the parents. We don't need you anymore. And then, yeah, it's interesting to take a look at Spark was supposed to kill the data warehouse and now SQL in it. And stuff like, they're doing pretty well. Last time I checked their SQL first, so it's been an interesting renaissance of and even I drew it, for example, they got some notion of SQL on it as well. So it's like all these highly performance databases. They all have superior powers in respective areas and they all use sequel on top.

Yeah. It's it is interesting. The other piece I would say too, that there's a lot of research and development in this area, but it's continuous learning of machine learning models. So the ability to retrain in real-time and learn. So that's something that I think there's still pretty early days in that area, but things like catastrophic forgetting, for example. That's being addressed by researchers right now. And I think that there'll be a fascinating area to serve. 

So what I mentioned too is just the reconciliation of just what you described, the source to use cases really in sort of the mushy layer in between like what happens with the streaming, ingestion, storage, transformations and stuff, and data quality too, data governance is another thing. Does that change in the streaming first paradigm? I know that there's projects that are starting to look into this and see and it'd be great expectations is maybe going to start considering or streaming use cases. I think, I don't know, but I would imagine so. And DBT, I think they're starting to maybe try and look at transformations in streaming as well, but because it's so-

Kostas Pardalis (38:21)

Yeah, absolutely. And actually something that seemed a couple of weeks ago, I realized, for example now also has support from the BD models which makes a very interesting transition for DBT to get into the streaming world. And that's something that I wanted to add. That actually you can see the power of SQL at the end and how important it is also by technologies like DBT. DBT game became extremely popular because it could give what was missing from SQL. And suddenly when you had all this missing parts from this language, or the ergonomics of the language, you see how powerful it can be especially for people who are more of the analyst side, not like pure ends in the arts, so.

Joe Reis (39:17)

That's a good point too with respect to streaming. Right now, I think a lot of the streaming technologies, there's still quite a bit of gate keeping because they're very complicated pieces of technology. It's not like spinning up a database or something. There's a lot to these things. Circuited systems, also Kubernetes or Yarn or whatever you want, but there's a lot of moving parts. Then you got to understand. Okay, so I got to understand how to run a distributed system to highly got, you're going to be running these systems on your laptop. And I think the layers of abstraction, this are starting to be simplified for people maybe. And then you can always look at the cloud solutions too.

There's plenty of good things that the clouds provide out of the box that you might want to look at like in AWS, Kinesis. And that I think for a lot of people work just fine. And then even for simple message brokers like SQS, those will work fine. And then Google has pops up. And I think when Spotify did their trial of it, I think the stats was through 2 million events per second at pub sub or something. It didn't even skip a beat when they were testing it. It just worked. 

So I don't think there's very many nods to pub stuff at all. So point being it's like if you're instant streaming, there's a lot of complexity and maybe the first thing you should do is try not to roll your own. Try and find something, manage out there or there if you can find a third party provider, a better stack or somebody else to handle your data ingestion, then simplify your life a bit. Don't, yeah, there's no reason to roll your own stuff in this day and age, unless you have a competitive advantage of doing so.

Kostas Pardalis (41:12)

Yeah, that's an excellent point. And it's one of the main pitfalls that I also see with streaming, because it's such a complex technology and new technology and still maturing. It's very easy for people to over engineer solutions. As with everything else in engineering, you always have to be very careful not to over engineer solutions. And it's easier to do it with a streaming platform compared to getting a traditional database and stop using it.

Joe Reis (41:44)

I'd say the other pitfalls too. Yeah, you bring up an interesting point and what I've seen is maybe engineering teams, there's the phenomenon of resume driven development. I think it's a real thing where it's like, oh, I can put Kafka on my resume or something do that. And so you're going to take these. So you're going to introduce Kafka into a system where perhaps data quality right now, or data is hard to reconcile even on a batch system, but now you're going to introduce Kafka on top of it. So all you're doing is basically doing dumb things faster. And you haven't actually solved the root issue. You're actually just perpetuating something and taking bad habits and putting them on lightspeed, which if your boss is cool with that, go for it. But I don't think a business is going to like that too much.

Kostas Pardalis (42:35)

Yeah, 100%. I totally agree with that. Yeah.

Brooks (42:40)

We are exciting point, I think with everything moving towards Joe, you said streaming first, but batch is not going away. So bring us home Joe, and just talk about why is moving beyond batch the future. And batch is likely always going to have its place, but why do we move beyond it?

Joe Reis (43:09)

Yeah, it's interesting because I always think of batch as a special case of streaming actually, because back to our original discussion back on the data stacks show, it's like all data at some point it originated as real-time event data. When it was input manually or through an automated process that those were events. And then somehow they got stUck maybe in a database and then you got to move them out. But I think these are artificial constraints. And because technology is, and I would say very gifted teams are now focusing on the problem of removing those bottlenecks.

Moving beyond batch is just, it's going to be a natural consequence. If you can get as close to the event as possible and respond to it then I think that notion is very attractive to individuals and companies. But so yeah, it's interesting because real-time has always been with us and continue being with us. And I would say for a lot of cases of batch isn't going anywhere, just it's convenient construct, but it's an artificial construct. Your thoughts Kostas?

Kostas Pardalis (44:31)

Yeah. I think we gave many reasons during this conversation about why we need something more than just batch and why we might see some kind of rebalancing between the importance of batch and streaming inside the company. I'll repeat again, that for me, the most important reason is latency. It's how fast we can react to what is happening out there and do that up scale. And in order to do that, we definitely need to go beyond batch. Batch will still be there. We don't have to talk about that to be honest, but more and more, we will see traditional use cases around batch processing to become a part of the streaming environment. So yeah, let's see. I think we're living in a very interesting times for anything that has to do with data in general and with data processing more specifically.

Joe Reis (45:33)

Yeah. That said, it will be interesting to revisit these types of conversations in a few years because I think you're right. Because we're living in an inflection point right now. And in real-time and no pun intended, but it's like, we are literally going through this as I say, almost this weird Renaissance phase of data architectures and concepts even. I think it's a recognition that batch isn't adequate enough for a lot of use cases, but we're still in the very early stages of what that next step looks like. And there's a lot of very smart people working to solve this problem receiving a lot of money from VCs and big companies. And I don't know, I'm fascinated, I'm just along for the ride, frankly who knows? We could all be wrong. Maybe everything is just going to be batch for the next 50 years. So I doubt it, but you never know, so.

Kostas Pardalis (46:37)

100%.

Brooks (46:39)

We shall see, time will tell that's for sure. Well, Kostas, Joe, thank you so much for your time and insight today. I know we're coming up on time here, but we do have a few questions. So want to move into a quick kind of question and answer session in really either of you can take this one. But we do have a question from an attendee. Given that the limitations preventing real-time data are often not CDPs or warehouses, but rather destinations like Marketo SFDC and so on, when do you think real-time streaming of event user account data is feasible based on support from destinations?

Joe Reis (47:29)

That is a big bottleneck. SFDC Amex, I think that's snowflake data cloud or okay, Salesforce data cloud, I don't know.

Kostas Pardalis (47:42)

Yeah, it's Salesforce.

Joe Reis (47:44)

Salesforce? Okay. I think it just depends on how willing those destinations are to treat streaming as a first class citizen. Otherwise, there's not a lot you're going to be able to do about it. And that's a simple fact. So there needs to be enough market demand for these platforms to change. But a lot of it's probably architectural limitations to what these underlying systems and that's going to take a lot of time to reconstruct. You can only look at Amazon Redshift, for example, that was built on Postgres, fork version of it with PAREXEL underneath it.

And it's still, I know either gres is working really hard to address some of the limitations of Redshift, but you see what, even a company like AWS and the resources it has to modernize Redshift is just, it's a fundamentally huge undertaking. It's going to just take time. And you're not going to make those investments, unless there's competition that's forcing you to do this or a good business reason to do it. So they attend these absolutely spot on the observation. This is a big bottleneck, but.

Kostas Pardalis (49:01)

Yeah, it's also a big opportunity to build the new Salesforce, right?

Joe Reis (49:11)

Yup, that's great.

Brooks (49:12)

Well, if we do have one more question, if you'll have time for one more. So real-time data is often more costly due to API limits or worse error prone due to rate limits, do you see a shift in the market from larger players that make this less of an issue?

Joe Reis (49:34)

You'll take that one Kostas?

Kostas Pardalis (49:36)

Yeah, sure. I think the answer is like similar to the previous question, to be honest. So yeah, the limitations are still there. And the best way to remove the limitation is by having the need out there for these kind of processing. So what I think is going to happen is that as more and more pressure is going to put by customers for this kind of infrastructure, we will see these limitations going away and that practice will go down. I think it's like just market dynamics at the end that they are going to drive this. So yeah, I feel much more let's say, confident that if we have this kind of limitations from something like a cloud environment or cloud infrastructure, this is much easier to be addressed.

So yeah, I would feel like AWS, GCP, they are going to react much faster to these changes and they have the capacity to do it. As Joe was saying, re-architecting something like Marketo is probably much harder as a task. So that's something that probably has more risk and probably it's going to be much slower. But I'm pretty sure that you can see that, if you see for example, Snowflake, which they started as a transitional lab data warehouse but they have streaming probably because right now you can stream data inside the data warehouse. Same thing with Redshift which is a much older technology. So I'm feeling much more confident in these technology providers would say, especially the cloud providers, the provider infrastructure that thinks will going into progress much faster for both streaming solutions.

Joe Reis (51:34)

Yeah. And again, it comes back to just, what's the demand for it? I've been on teams where we built APIs. And typically these only happen when there's requests to do so because as an engineering team, it can be very busy dealing with technical debt and new features and unless your new feature for your product manager happens to be APIs that overcome the limits pagination or rate limits, it's like you're going to get what you're going to get.

And so, my advice is if you have a favorite API, I'd like to get real-time data from just go bug the hell out of the company and keep asking for it and get all your friends to do it too, because, and pay them money, this kind of thing. But it's not going to happen on its own. That's just, your biggest bottleneck really is just I think human beings at the end of the day and the fact that we're pretty lazy. And unless there's an overwhelming reason for us to make our APIs more available to people, it's like, why are we going to do that? We are, it's a simple fact. So money talks and that's a simple fact, so.

Brooks (52:50)

That's good. Well, I think the theme of today is, we're at a very exciting point and time will tell us, things continue moving towards a distant knit first paradigm. So yeah, I think it's just exciting. Joe, like you said, maybe in five years we will do another webinar looking back on this conversation and we'll see if we got it right.

Joe Reis (53:15)

Sure. It's my favorite clairvoyance. It's always fun to prognosticate about the future, but it's more fun to figure out where you got it right and wrong, so.

Brooks (53:26)

That's great.

Joe Reis (53:27)

Awesome. Well, thanks for having me on.

Brooks (53:28)

Well, Joe and Kostas, thank you again for your time. Everyone, thanks for joining us. Be sure to check out rudderstack.com and ternary data.com and we'll see you next time. Thanks.

Joe Reis (53:40)

Thanks. All right. See y'all.

Brooks (53:48)

Bye.

[inaudible 00:53:48].